forestmaker: Tool for making feature forest model

This is a tool for making event files of feature forest models.

forestmaker model_name grammar_module derivbank event_file
model_name	name of a probabilistic model (this will be used in parsing)
grammar_module	lilfes program in which a grammar and predicates for event extraction are implemented
derivbank	derivbank obtained by "lexextract" (lildb format)
event_file	file to output unfiltered events (text format or compressed (gz or bz) format)
Options
-r file_name	file to output reference distribution
-n threshold	limit number of events to be output
-v	print debug messages
-vv	print many debug messages

The name of a probabilistic model must be assigned to each event file. This means that by assigning different names, you can use multiple models in parsing. For example, if you incorporate a unigram model as a reference distribution into a feature forest model, you must assign different names to the models.

This tool supports the construction of a maximum entropy model of a derivation, given a grammar and a derivbank. This tool makes unfiltered events that will be required for the estimation of a probabilistic model.

An unfiltered event is a string that has several fields separated by "//". An example is as follows.

SUBJ//plays//VBZ//[npVPnp]//haag//NNP//[NP]_2//binary

The last field ("binary") denotes the category of this event. A category will be used in the later steps, such as for applying masks to the events. Events that have the same category name must have the same number of fieds, since the same masks are applied to them. This means that you must use different category names for events that have different number of fields. For example, the numbers of fields must be different for binary and unary rule applications, because they should be represented with the different number of fields.

An unfiltered event represents a derivation forest for a sentence with a feature forest format. The model estimation requires derivation forests for all sentences in a training data (i.e., derivbank), the tool parses all sentences and outputs forests of probabilistic events by extracting probabilistic events for each node in derivation forests. Hence, this tool requires the implementaiton of the interfaces for parsing and for extracting probabilistic events from derivations.

First, in order to parse sentences, you must implement the interfaces defined in "UP" (such as id_schema_binary). For details, see "How to use a grammar" and the manual of UP.

In addition, you must implement the following predicates defined in "mayz/forestmake.lil", which substitute for sentence_to_word_lattice/2 and lexical_entry/2.

`fm_derivation_to_word_lattice(+$Derivation, -$WordLattice)`
$Derivation	derivation
$WordLattice	word lattice (list of 'extent')
Make a word lattice from a derivation.

`fm_lexical_entry(+$Lex, -$LexName)`
$Lex	input word and the named of a template that will be assigned to the word (lex_entry)
$LexName	LEX_NAME (the second argument of 'lexical_entry/2')
Provide lexical entries that are assinged to a word.

The above predicates may be implemented like "sentence_to_word_lattice/2" and "lexical_entry/2". However, they provide us information that is necessary for the correct derivation, and this information may be exploited. For example, since "fm_lexical_entry/2" gives the name of the correct lexical entry, we can cut off lexical entries with low probabilities by returning the correct lexical entry and other lexical entries with high probabilities. This technique greatly reduces the time for parsing training sentences, and hence for making an event file. Note that correct lexical entries must be included in assigned lexical entries because a derivation forest must include a correct derivation tree.

The following predicte must also be implemented to make correct derivation trees. While derivations in a derivbank are used for making correct derivation trees, the following predicate is necessary for providing lexical entries corresponding to terminal nodes.

`fm_correct_lexical_entry(+$Term, -$LexName)`
$Term	terminal node of a derivation (derivation_terminal)
$LexName	LEX_NAME (the second argument of lexical_entry/2)
Returns the correct lexical entry corresponding to a terminal node of a derivation.

Next, the following interfaces defined in "mayz/amismodel.lil" are required for extracting probabilistic events. They extract an event from each node in a derivation forest. An event is represented as a list of strings. "forestmaker" calls these predicates for each node in a derivation forest, and the results are output into an event file in a feature forest format.

`extract_terminal_event(+$ModelName, -$Category, +$LexName, +$Sign, +$SignPlus, -$Event)`
$ModelName	name of a probabilistic model
$Category	name of a category
$LexName	LEX_NAME (the second argument of "lexical_entry/2")
$Sign	lexical entry
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Event	event (a list of strings)
Extract an event of a terminal node.

`extract_unary_event(+$ModelName, -$Category, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Event)`
$ModelName	name of a probabilistic model
$Category	name of a category
$SchemaName	name of a schema
$Dtr	daughter sign
$Mother	mother sign
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Event	event (a list of strings)
Extract an event of a unary rule application.

`extract_binary_event(+$ModelName, -$Category, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Event)`
$ModelName	name of a probabilistic model
$Category	name of a category
$SchemaName	name of a schema
$LeftDtr	sign of a left daughter
$RightDtr	sign of a right daughter
$Mother	sign of a mother
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Event	event (a list of strings)
Extract an event of a binary rule application.

`extract_root_event(+$ModelName, -$Category, +$Sign, -$Event)`
$ModelName	name of a probabilistic model
$Category	name of a category
$Sign	name of a schema
$Event	sign of a root node
Extract an event of a root node.

The name of a probabilistic model must be the same as the first command-line argument of "forestmaker".

For each interface, we also provide a version in which the value of a feature function (integer or float) can be specified. Add the feature value as the last argument.

extract_terminal_event_feature_value/7
extract_unary_event_feature_value/8
extract_binary_event_feature_value/9
extract_root_event_feature_value/5

"forestmaker" allows for the development of an event file with a reference distribution. Specify the file name of a reference distribution in the "-r" option, and implement the following interfaces.

`reference_prob_terminal(+$ModelName, +$LexName, +$Sign, +$SignPlus, -$Prob)`
$ModelName	name of a probabilistic model
$LexName	LEX_NAME (the second argument of "lexical_entry/3")
$Sign	sign of a terminal node
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Prob	reference probability of a terminal node
Returns a reference probability of a terminal node.

`reference_prob_unary(+$ModelName, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Prob)`
$ModelName	name of a probabilistic model
$SchemaName	name of a schema
$Dtr	daughter sign
$Mother	mother sign
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Prob	reference probability
Returns a reference probability of a unary rule application.

`reference_prob_binary(+$ModelName, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Prob)`
$ModelName	name of a probabilistic model
$SchemaName	name of a schema
$LeftDtr	sign of a left daughter
$RightDtr	sign of a right daughter
$Mother	sign of a mother
$SignPlus	SIGN_PLUS (the third argument of "reduce_sign/3")
$Prob	reference probability
Returns a reference probability of a binary rule application.

`reference_prob_root(+$ModelName, +$Sign, -$Prob)`
$ModelName	name of a probabilistic model
$Sign	sign of a root node
$Prob	reference probability
Returns a reference probability of a root node.

MAYZ Toolkit Manual MAYZ Home Page Tsujii Laboratory

MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)