forestmaker: Tool for making feature forest model

Japanese version

This is a tool for making event files of feature forest models.

forestmaker model_name grammar_module derivbank event_file
model_namename of a probabilistic model (this will be used in parsing)
grammar_modulelilfes program in which a grammar and predicates for event extraction are implemented
derivbankderivbank obtained by "lexextract" (lildb format)
event_filefile to output unfiltered events (text format or compressed (gz or bz) format)
Options
-r file_namefile to output reference distribution
-n thresholdlimit number of events to be output
-vprint debug messages
-vvprint many debug messages

The name of a probabilistic model must be assigned to each event file. This means that by assigning different names, you can use multiple models in parsing. For example, if you incorporate a unigram model as a reference distribution into a feature forest model, you must assign different names to the models.

This tool supports the construction of a maximum entropy model of a derivation, given a grammar and a derivbank. This tool makes unfiltered events that will be required for the estimation of a probabilistic model.

An unfiltered event is a string that has several fields separated by "//". An example is as follows.

SUBJ//plays//VBZ//[npVPnp]//haag//NNP//[NP]_2//binary

The last field ("binary") denotes the category of this event. A category will be used in the later steps, such as for applying masks to the events. Events that have the same category name must have the same number of fieds, since the same masks are applied to them. This means that you must use different category names for events that have different number of fields. For example, the numbers of fields must be different for binary and unary rule applications, because they should be represented with the different number of fields.

An unfiltered event represents a derivation forest for a sentence with a feature forest format. The model estimation requires derivation forests for all sentences in a training data (i.e., derivbank), the tool parses all sentences and outputs forests of probabilistic events by extracting probabilistic events for each node in derivation forests. Hence, this tool requires the implementaiton of the interfaces for parsing and for extracting probabilistic events from derivations.

First, in order to parse sentences, you must implement the interfaces defined in "UP" (such as id_schema_binary). For details, see "How to use a grammar" and the manual of UP.

In addition, you must implement the following predicates defined in "mayz/forestmake.lil", which substitute for sentence_to_word_lattice/2 and lexical_entry/2.

fm_derivation_to_word_lattice(+$Derivation, -$WordLattice)
$Derivationderivation
$WordLatticeword lattice (list of 'extent')
Make a word lattice from a derivation.
fm_lexical_entry(+$Lex, -$LexName)
$Lexinput word and the named of a template that will be assigned to the word (lex_entry)
$LexNameLEX_NAME (the second argument of 'lexical_entry/2')
Provide lexical entries that are assinged to a word.

The above predicates may be implemented like "sentence_to_word_lattice/2" and "lexical_entry/2". However, they provide us information that is necessary for the correct derivation, and this information may be exploited. For example, since "fm_lexical_entry/2" gives the name of the correct lexical entry, we can cut off lexical entries with low probabilities by returning the correct lexical entry and other lexical entries with high probabilities. This technique greatly reduces the time for parsing training sentences, and hence for making an event file. Note that correct lexical entries must be included in assigned lexical entries because a derivation forest must include a correct derivation tree.

The following predicte must also be implemented to make correct derivation trees. While derivations in a derivbank are used for making correct derivation trees, the following predicate is necessary for providing lexical entries corresponding to terminal nodes.

fm_correct_lexical_entry(+$Term, -$LexName)
$Termterminal node of a derivation (derivation_terminal)
$LexNameLEX_NAME (the second argument of lexical_entry/2)
Returns the correct lexical entry corresponding to a terminal node of a derivation.

Next, the following interfaces defined in "mayz/amismodel.lil" are required for extracting probabilistic events. They extract an event from each node in a derivation forest. An event is represented as a list of strings. "forestmaker" calls these predicates for each node in a derivation forest, and the results are output into an event file in a feature forest format.

extract_terminal_event(+$ModelName, -$Category, +$LexName, +$Sign, +$SignPlus, -$Event)
$ModelNamename of a probabilistic model
$Categoryname of a category
$LexNameLEX_NAME (the second argument of "lexical_entry/2")
$Signlexical entry
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Eventevent (a list of strings)
Extract an event of a terminal node.
extract_unary_event(+$ModelName, -$Category, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Event)
$ModelNamename of a probabilistic model
$Categoryname of a category
$SchemaNamename of a schema
$Dtrdaughter sign
$Mothermother sign
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Eventevent (a list of strings)
Extract an event of a unary rule application.
extract_binary_event(+$ModelName, -$Category, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Event)
$ModelNamename of a probabilistic model
$Categoryname of a category
$SchemaNamename of a schema
$LeftDtrsign of a left daughter
$RightDtrsign of a right daughter
$Mothersign of a mother
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Eventevent (a list of strings)
Extract an event of a binary rule application.
extract_root_event(+$ModelName, -$Category, +$Sign, -$Event)
$ModelNamename of a probabilistic model
$Categoryname of a category
$Signname of a schema
$Eventsign of a root node
Extract an event of a root node.

The name of a probabilistic model must be the same as the first command-line argument of "forestmaker".

For each interface, we also provide a version in which the value of a feature function (integer or float) can be specified. Add the feature value as the last argument.

"forestmaker" allows for the development of an event file with a reference distribution. Specify the file name of a reference distribution in the "-r" option, and implement the following interfaces.

reference_prob_terminal(+$ModelName, +$LexName, +$Sign, +$SignPlus, -$Prob)
$ModelNamename of a probabilistic model
$LexNameLEX_NAME (the second argument of "lexical_entry/3")
$Signsign of a terminal node
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Probreference probability of a terminal node
Returns a reference probability of a terminal node.
reference_prob_unary(+$ModelName, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Prob)
$ModelNamename of a probabilistic model
$SchemaNamename of a schema
$Dtrdaughter sign
$Mothermother sign
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Probreference probability
Returns a reference probability of a unary rule application.
reference_prob_binary(+$ModelName, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Prob)
$ModelNamename of a probabilistic model
$SchemaNamename of a schema
$LeftDtrsign of a left daughter
$RightDtrsign of a right daughter
$Mothersign of a mother
$SignPlusSIGN_PLUS (the third argument of "reduce_sign/3")
$Probreference probability
Returns a reference probability of a binary rule application.
reference_prob_root(+$ModelName, +$Sign, -$Prob)
$ModelNamename of a probabilistic model
$Signsign of a root node
$Probreference probability
Returns a reference probability of a root node.

MAYZ Toolkit Manual MAYZ Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)