Corpus-oriented grammar development [1] is a new methodology of grammar engineering targetting linguistically-motivated lexicalized grammars including HPSG. Traditionally, the development of lexicalized grammars must rely on considerable human efforts, and has been an extreamly difficult task because we must have written fine-grained constraints of lexical entries for a number of words. Especially, grammar development often requires redesign of the grammar, and it is very difficult to maintain the consistency of rapidly changing, extremely many, complicated constraints.
In our approach, we first develop a treebank of the target grammar formalism. Treebank development means that we externalize our analysis of each sentence as annotation of the sentence. It should be rather easy to write concrete analysis of example sentences than to write abstract linguistic constraints.
Treebank development can be accelerated by various methods. First, we do not need to write down all constraints assumed in the target grammar formalism. By specifying partial constraints, other constraints are complemented by applying grammar rules and principles. For example, we do not need to write detailed constraints of subcategorization frames of words. They are automatically induced from corpus annotation. Second, we can employ various heuristic rules for the annotation. Inspite of using heuristic rules for corpus annotation, the resulting treebank will conform to the theory of the target formalism. The violation of the theoretical formulation caused by misapplication of the rules will be detected as failures of the application of grammar rules and principles.
Given the treebank, we can obtain lexical entries as terminals in each tree. Since the treebank conforms to the grammar theory, the obtained grammar as well conforms to the theory. This process is completely deterministic and automatic (i.e., without any statistical induction), and we can easily identify the ground of obtained lexical entries.
In short, we can employ various heuristic methods for accelerating treebank development, while the theoretical validity of the obtained grammar is assured by the constraints offered by grammar rules and principles of the grammar theory. Linguistic theories regulate the constraints in the treebank, while concrete corpus annotations greatly help the writing of grammatical constraints.
In addition, we can obtain a treebank and a grammar at the same time. This means that we have resources for the statistical modeling of disambiguation models. For linguists, our approach provides the basis for example-based investigation of grammar theories.
Our approach is essentially different from the automatic learning of a grammar from a treebank because we exploit a grammar theory to constrain the shape of the treebank. This is a crucial point to assure the linguistic validity of an obtained grammar.
We have implemented a toolkit for corpus-oriented grammar development in LiLFeS. While the concept of our grammar engineering is applicable to any lexicalized grammars, we are now developing a wide-coverage HPSG grammar of English. As a starting point of the grammar development, we used Penn Treebank. By applying several pattern rules to the Penn Treebank, we developed an HPSG treebank and obtained a wide-coverage HPSG grammar. The treebank was also used for developing a disambiguation model of HPSG [2].
Our HPSG parser is available as Enju. If you want to install the HPSG parser, see Enju Home Page for details.
A toolkit for grammar development (the MAYZ toolkit) is also available. Follow the instructions below for downloading and installation.
% tar xvzf mayz-x.y.z.tar.gz
% cd mayz-x.y.z % ./configure
% make % make install