lexextract: Tool for making derivation and lexicon

Japanese version

This tool is for making derivation trees and lexical entries from parse trees.

lexextract [options] lexextract_module treebank derivbank lexicon template lexbank
lexextract_modulelilfes program in which inverse schemas and inverse lexical rules are defined
treebankinput treebank (lildb format)
derivbankfile to output a derivbank (lildb format)
lexiconfile to output a lexicon (lildb format)
templatefile to output lexical entry templates (lildb format)
lexbankfile to output derivation terminals (lildb format)
Options
-vprint debug messages
-vvprint many debug messages

This tool converts a treebank made by "treetrans" into a derivbank (derivation trees of the target grammar theory). It also extracts lexical entries from the derivbank.

First, write inverse schemas with the following interfaces, in order to convert parse trees in an input treebank into derivation trees. The inverse schemas are applied in the following order.

  1. Apply "root_constraints/1" to the value of "TREE_NODE\NODE_SIGN\" of the root node of a parse tree.
    root_constraints(-$Sign)
    $Signsign of the root node
    Unify $Sign with the sign of the root of the derivation tree.
  2. To "TREE_NODE\NODE_SIGN\" of each node, apply "inverse_schema_binary/4" or "inverse_schema_unary/3" in a topdown way. The value of "TREE_NODE\SCHEMA_NAME\" is used as the name of a schema. Schemas are applied in a depth-first order.
    inverse_schema_binary(+$SchemaName, +$Mother, -$Left, -$Right)
    $SchemaNameschema name
    $Mothersign of the mother
    $Leftsign of the left daughter
    $Rightsign of the right daughter
    Apply a binary schema to $Mother and obtain daughter signs.
    inverse_schema_unary(+$SchemaName, +$Mother, -$Dtr)
    $SchemaNameschema name
    $Mothersign of the mother
    $Dtrsign of the daughter
    Apply a unary schema to $Mother, and obtain a daughter sign.
  3. After applying inverse schemas to all internal nodes, apply "lexical_constraints/2" to "TREE_NODE\NODE_SIGN\" of terminal nodes. Since this is done after all applications of inverse schemas, you can coerce default constraints using this interface.
    lexical_constraints(+$Word, -$Sign)
    $Wordfeature structure representing a word (the value of "TREE_NODE\WORD\")
    $Signsign of a terminal node
    Unify $Sign with the sign of a terminal node.
A derivation tree made by the above process is represented with a feature structure defined in "derivtypes.lil". A list of terminal nodes is stored in "lexbank".

Next, from terminal nodes of derivation trees, extract lexical entry templates and mappings from a word into lexical entry templates. Interfaces for lexicon extraction are defined in "lexextract.lil". The extraction algorithm is presented below. In each of the following steps, the target feature structures are copied. This means that even when you modify the target feature structures with new constraints or destructive operations the modifications will not affect derivation trees nor other lexical entries.

  1. Apply "lexical_entry_template/3" to "DERIV_SIGN\" of each terminal node of a derivation tree. The result is stored in "LEXENTRY_SIGN\" of the derivation tree.
    lexical_entry_template(+$Word, +$Sign, -$Template)
    $Wordfeature structure representing a word
    $Signlexical sign
    $Templatelexical entry template
    Make a lexical entry template $Template from lexical sign $Sign of the word $Word.
  2. Apply "reduce_lexical_template/5" to the lexical entry template, and obtain a key to look up a lexicon, a sign of a lexeme, and a history of lexical rule applications. The obtained lexeme will be stored in a template database. Lexeme signs are also stored in "LEXEME_SIGN\" of derivation trees.
    reduce_lexical_template(+$Word, +$InTemplate, -$Key, -$OutTemplate, -$LexRules)
    $Wordfeature structure representing a word
    $InTemplateinput lexical entry template (the output of "lexical_entry_template/3")
    $Keykey to look up a lexicon
    $OutTemplatesign of a lexeme
    $LexRulesa list of applied lexical rules
    Obtain a sign of a lexeme by inversely applying lexical rules to a lexical entry template obtained by "lexical_entry_template/3"
  3. If a lexeme sign is not stored in the database yet, i.e., it is first to see, apply "lexeme_name/4" to the lexeme sign to obtain the name of a lexeme. The pair of this name and the history of the application of lexical rules will be the name of a lexical entry template. A mapping from a key to a lexical entry template is stored in a lexicon database. Template names are stored in "TERM_TEMPLATE\" of derivation trees.
    lexeme_name(+$Word, +$Template, +$ID, -$Name)
    $Wordfeature structure representing a word
    $Templatesign of a lexeme
    $IDidentification number (integer)
    $Namename of a lexeme (string)
    Assign a unique name to a lexeme
  4. Increment the occurrence count of a word. Occurrence counts will be used for cutting off infrequent words in "lexrefine".
    word_count_key(+$LexKey, -$CountKey)
    $LexKeykey to look up a lexicon
    $CountKeykey to be used for counting a word
    Obtain a key to count the occurrence of a word. If you want to count different keys as an identical word, implement this predicate to return the same $CountKey for the different keys.
Finally, a lexicon and a template database are stored in files.
MAYZ Toolkit Manual MAYZ Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)