Overview of grammar development

Japanese version

Making a grammar

To make a grammar, we use treetrans, lexextract, and lexrefine. The process of grammar development is shown in the following figure.

The process of
grammar development

  1. Prepare a source treebank

    At first, an input treebank is prepared. A treebank annotated with traces and predicate-argument structures, such as in Penn Treebank, is best for the development of lexicalized grammars. However, if you do not have such a treebank, you can add various additional information into the corpus by using heuristic pattern rules. In an extremecase, if you have only a raw text, you can start with annotating shallow analysis using exising shallow parsers.

  2. Transform input trees, and increase annotations

    Using treetrans, we transform input trees, and annotate additional information to them. A tree is represented with a feature structure, and transformation rules are writtein in LiLFeS. For the easy wrigint of rules, several interfaces are provided. In addition, several tools are provided for convenience, such as for binarization of trees and for head detection.

  3. Extract a grammar

    Using lexextract, we make derivbank, lexbank, lexicon, and templates. By implementing grammar rules of the target grammar formalism, the tool automatically extract a lexicon from the transformed treebank.

    First, to make a derivbank from a transformed treebank, we implement inverse grammar rules (schemas and principles). Next, we implement an interface to extract a lexical entry template from a terminal feature structure of a derivation tree. Additionally, by implementing inverse lexical rules, we can extract templates of lexemes in which lexical transformations such as inflections are abstracted.

  4. Refine the grammar

    Using lexrefine, we refine the lexicon and the templates. Due to the mistakes in the tree transformation, the lexicon and templates obtained in the above process may include unpleasant lexical entries. They are refined by filtering out infrequent lexical entry templates. In addition, although the original lexicon does not include lexical entries for unknown words, we can make the unknown word entries by regarding infrequent words in a treebank as an unknown word. Furthermore, in this stage, we can apply lexical rules to the templates. This process can save the time for applying lexical rules in parsing.

Making a probabilistic model

To make a probabilistic disambiguation model, we use unimaker, forestmaker, amisfilter,and an estimator for maximum entropy models, amis (version 3.0 or higher). The flow of making a model is as follows.

The process of
making a probabilistic model

To make a unigram model

Unigram model assigns a probability of selecting a lexical entry to a word. Given a word w and a lexical entry l, the model gives the probability p(l|w). This model can be easily estimated because the model estimation requires only cooccurrence counts of words and lexical entries. However, this model attains relatively lower accuracy because it cannot capture the preferences of the application of grammar rules. This model should be used when we test a grammar/parser, or in order to reduce the cost of making a feature forest model, which is described later.

To make a unigram model, we implement extract_lexical_event/4 and feature_mask/3 defined in "mayz/amismodel.lil". The former is for extracting an event from a words and a lexical entry. The latter is for defining masks applied to the event, and features of maximum entropy models are extracted.

First, using unimaker, events are extracted into an initial event file. Next, using amisfilter, amis-style data files are created by applying masks to the initial event file. Finally, using "amis", parameters of the model are estimated.

To make a feature forest model

A feature forest model give a probability of a derivation assigned to a given sentence. Given a sentence s and its derivation d, the model gives p(d|s). The estimation of this model requires much computational cost because parsing all training sentences is required and event data will be huge. However, various preferences, such as preference of grammar rules and bi-lexical depdendencies, can be incorporated to the model, and high accuracy is expected.

For the estimation of a feature forest model, we need a pair of a correct parse tree and a parse forest for each sentence in a training corpus. Given a derivbank, "forestmaker" makes a correct parse tree from a derivation, and parses a sentence to make a parse forest. To parse a sentence, we need to implement an interface for parsing. For details, see How to use a grammar.

To make a feature forest model, we implement an interface defined in "mayz/amismodel.lil": extract_terminal_event/6, extract_unary_event/7, extract_binary_event/8, extract_root_event/4, and feature_mask/3. The first four predicates are for extracting an event corresponding to a terminal node, unary schema application, binary schema application, and a root node, respectively. The final one is for applying a mask to the extracted events, as in making a unigram model.

First, using forestmaker, events are extracted and an initial event file is created. The remaining process is the same as making a unigram model; amisfilter makes an amis-style data files, and amis estimates model parameters.

When making a feature forest model, a simpler model (e.g. a unigram model) can be used as a reference distribution. By implementing reference_prob_terminal/5, reference_prob_unary/6, reference_prob_binary/7, and reference_prob_root/3, we can specify a reference probability distribution of a terminal node, unary schema application, binary schema application, and a root node, respectively.


MAYZ Toolkit Manual MAYZ Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)