Enju 2.4 Output Specifications

Yusuke Miyao (National Institute of Informatics, Japan)
Last updated: 18/Jun/2010


  1. Overview
  2. Phrase structures
  3. Predicate argument structures
  4. HPSG-specific features
  5. Appendix: Correspondences between Enju and PTB

1. Overview

Enju outputs phrase structures and predicate argument structures in an XML format. Phrase structures are tree structures that express how words are combined to form phrases and clauses. Predicate argument structures are a set of relations that describe semantic relations of words/phrases/clauses in a sentence.

For example, a phrase structure of the sentence "John has come." is described by the following tree structure:

(S (NP John) (VP has (VP come)))
where "S", "NP", and "VP" are syntactic categories of phrases. In phrase structures, internal nodes are syntactic categories, and terminal nodes are words.

A predicate argument structure of the same sentence is given as the following three relations.

< come, arg1, John >
< has, arg1, John >
< has, arg2, come >

Each tuple represents a labeled relation (a labeled dependency from a word to a phrase/clause). For example, the first tuple means that "arg1" (the first argument, i.e., the semantic subject) of "come" is "John". The first element of a tuple expresses a predicate word, the second element is a label of the relation, and the third element is an argument phrase/clause of the predicate. Because relations represented by predicate argument structures are relations between a predicate and its arguments in a predicate logic, they do not necessarily correspond to syntactic head/argument. For example, when a prepositional phrase modifies a noun, the preposition is represented as a predicate and the noun is denoted as the "arg1" of the preposition, while, in syntax, the noun is the head and the prepositional phrase is the non-head.

Enju represents both structures in XML with three tags as listed below.
sentenceid parse_status fomsentence
consid cat xcat schema head sem_head syntactic constituents (phrases, clauses, etc.)
tokid cat pos base tense aspect voice aux type lexentry pred arg1 arg2 arg3 arg4 mod words, punctuations

In general, "cons" represents phrase structures. Unique identifiers are assigned to every "cons" tag, and they are denoted by the "id" attribute. Syntactic categories of phrases and clauses, such as "NP" and "S", are expressed by the "cat" attribute, and finer classifications of categories are represented by the "xcat" attribute. Syntactic/semantic head daughters of internal nodes are expressed by the "head" and "sem_head" attributes, respectively. These attributes denote the identifier of "cons" of the head daughter. Terminal nodes of phrase structures are "tok" tags, and, like "cons", identifiers and syntactic categories are represented by the "id" and "cat" attributes. Additionally, Penn Treebank-style POS and a base form of a word are described by the "pos" and "base" attributes.

Predicate argument structures are expressed by attributes of the "tok" tag: "pred" denotes the type of the predicate for a word. "arg1", ..., "arg4", and "mod" denote identifiers of "cons", and they represent argument or modification phrases/clauses of the predicate. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause which is assigned the identifier "c1".

The other tag, "sentence", is used to express additional information of parsing results. A string processed by Enju as a sentence is bracketed by "sentence", and is assigned an identifier (the "id" attribute). The "parse_status" attribute represents whether parsing has succeeded or not. This attribute has "success" when parsing succeeded, and has a reason of the failure as listed below when failed.
parse_statusreason of failure
empty lineThe input line is an empty string
fragmental parseAlthough the parser could not produce an analysis that spans the whole sentence, it outputs fragmental parse results
no successful parseThe parser could not produce an analysis
POS tagging errorTokenization or POS tagging failed or returned an ill-formed string
lexical entry assignment errorAssignment of lexical entries failed (e.g. caused by wrong POS tags)
sentence length limit exceededThe number of words was larger than the limit (enlarge the limit of sentence length to parse those sentences)
edge number limit exceededThe number of produced edges exceeded the limit (enlarge the limit of edge number to parse those sentences)
parser setup errorSet-up of internal data structures failed (e.g. failure of memory allocation)
XML encoding errorSomething wrong happened when encoding a parse result into an XML format (contact the developer when you find this error)
unknown errorError caused by an unknown reason (contact the developer when you find this error)
fatal errorAn unrecoverable error occurred (contact the developer when you find this error)

The "fom" attribute denotes a figure-of-merit, which is a score of goodness of the parsing result.

Here is an example XML-style output of the parsing of sentence "John has come":

<sentence id="s0" parse_status="success" fom="7.28">
  <cons id="c0" cat="S" xcat="" head="c3" sem_head="c3">
    <cons id="c1" cat="NP" xcat="" head="c2" sem_head="c2">
      <cons id="c2" cat="NX" xcat="" head="t0" sem_head="t0" >
        <tok id="t0" cat="N" pos="NNP" base="john" pred="noun_arg0">
    <cons id="c3" cat="VP" xcat="" head="c4" sem_head="c5">
      <cons id="c4" cat="VX" xcat="" head="t1" sem_head="t1">
        <tok id="t1" cat="V" pos="VBZ" base="have" pred="aux_arg12" arg1="c1" arg2="c5">
      <cons id="c5" cat="VP" xcat="" head="t2" sem_head="t2">
        <tok id="t2" cat="V" pos="VBN" base="come" pred="verb_arg1" arg1="c1">

Linguistic meanings of syntactic categories and predicate-argument relations will be explained in Section 2 and 3.

Other linguistic features, such as applied schema, tense, and aspect, are output as attributes of the "cons" and "tok" tags, although they are omitted from the example above. These are explained in Section 4.

2. Phrase structures

Phrase structures are represented by two tags, "cons" and "tok": "cons" expresses internal nodes, and "tok" denotes terminal nodes of tree structures. In general, "cons" corresponds to phrases and clauses, which are called "constituents", while "tok" corresponds to words and punctuations, which are called "tokens". All "cons" and "tok" tags are assigned unique identifiers, which will be used to express various linguistic relations by identifying targets of the relations. Attributes of these tags represent syntactic information of annotated constituents.

Attributes of "cons"
catsyntactic category
xcatextra features of syntactic category
headID of the syntactic head daughter
sem_headID of the semantic head daughter

Attributes of "tok"
catsyntactic category
posPenn Treebank-style part-of-speech tag
basebase form

A tree of syntactic constituents is expressed by nested "cons" tags. In the output of Enju, all trees have only binary or unary branchings; that is, each "cons" tag covers at most two "cons" tags in it.

Syntactic categories of constituents are expressed by the attribute, "cat". A value of "cat" is a concatenation of a POS (e.g., ADJ) and a suffix which indicates whether a constituent is a saturated phrase (expressed by "P") or an unsaturated constituent ("X"). A list of POSs is given below.
CONJCoordination conjunction
SCSubordination conjunction

Additional symbols express other types of constituents. They are used as values of "cat" without suffixes.
COODPart of coordination

Syntactic categories of tokens are also specified as values of "cat" of "tok" tags. The values are the same as the above, while, in this case, suffixes are not expressed. Additional attributes, "pos" and "base", represent morphological information: a Penn Treebank-style POS tag and the base form of the input string.

For example, "cat" of "tok" of the word "John" is "N", because this word is a noun. The value of "pos" is "NNP", which means a singular proper noun. It also constitutes a nominal constituent without a determiner, and therefore is assigned "NX" as a value of "cons". Furthermore, since this constituent can take an empty determiner to become a noun phrase, it is also assigned another "cons" tag with "NP" as the value of "cat".

Since symbols of "cat" are sometimes very coarse, "xcat" expresses important linguistic distinctions. The value of "xcat" is a space-separated set of the following values.
COODCoordinated phrase/clause
IMPImperative sentence
INVSubject-verb inversion
QInterrogative sentence with subject-verb inversion
RELA relativizer included
FRELA free relative included
TRACEA trace included
WHA wh-question word included

The value of "head" or "sem_head" is an identifier of one of its daughters. They indicate the identifier of the head daughter of the phrase.

The syntactic head of a constituent is denoted by "head", and is a daughter constituent that determines syntactic characteristics of the constituent. Usually, the syntactic head of "X phrase" (X=verb, noun, adjective, etc.) is X. The syntactic head of a sentence is a main verb phrase.

The semantic head of a phrase is denoted by "sem_head", and is a daughter constituent that mainly conveys a semantic content of the constituent. That is, function words are not semantic heads even when they are syntactic heads, while content words are syntactic and semantic heads. Actually, in the current implementation of Enju, "head" and "sem_head" are different in the following cases.

In other cases, "head" is identical to "sem_head".

3. Predicate argument structures

Predicate argument structures are expressed by attributes of "tok": "pred", "arg1", ..., "arg4", and "mod". "pred" denotes the type of a predicate. The others denote identifiers of constituents. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause identified by "c1".

A value of "pred" is one of the following values. It is a concatenation of a POS (e.g., noun, verb) and a symbol (or symbols) which indicates required arguments.
prednoun_arg0, noun_arg1, noun_arg2, noun_arg12, it_arg1, there_arg0, quote_arg2, quote_arg12, quote_arg23, quote_arg123, poss_arg2, poss_arg12, aux_arg12, aux_mod_arg12, verb_arg1, verb_arg12, verb_arg123, verb_arg1234, verb_mod_arg1, verb_mod_arg12, verb_mod_arg123, verb_mod_arg1234, adj_arg1, adj_arg12, adj_mod_arg1, adj_mod_arg12, conj_arg1, conj_arg12, conj_arg123, coord_arg12, det_arg1, prep_arg12, prep_arg123, prep_mod_arg12, prep_mod_arg123, lgs_arg2, dtv_arg2, punct_arg1, app_arg12, lparen_arg123, rparen_arg0, comp_arg1, comp_arg12, comp_mod_arg1, relative_arg1, relative_arg12 predicate type

Argument numbers ("X" in "argX") are assigned in the order of surface realizations in declarative sentences. For nouns, verbs, adjectives, adverbs, and prepositions, "arg1" is assigned to a left argument, and "arg2", ..., "arg4" are assigned to right arguments in a left-to-right order. "mod" is assigned to a modifiee of VP modifiers (e.g. a matrix clause of participial construction). For complementizers and determiners, their dependent phrases/clauses will be "arg1". The complement of "'s" (e.g. "John" in "John 's") is expressed as "arg2". For punctuations, and particles, their dependent phrases are denoted by "arg1". For subordination/coordination conjunctions, main/left conjuncts are represented by "arg1", and the other conjuncts are expressed as "arg2".

For example, "A beautiful butterfly is coming into my room." has following predicate argument relations.

< coming, arg1, A beautiful butterfly >
< is, arg1, A beautiful butterfly >
< is, arg2, coming into my room >
< beautiful, arg1, butterfly >
< a, arg1, beautiful butterfly >
< into, arg1, coming >
< into, arg2, my room >
< my, arg1, room >

In the XML format, each relation is expressed by an attribute of "tok". For example, when we suppose the identifier of the phrase "A beautiful butterfly" is "c1", the XML annotation for "coming" will be like this:

<tok id="t1" cat="V" pos="VBG" base="come" pred="verb_arg1" arg1="c1" >

4. HPSG-specific features

Because the grammar of Enju has rich linguistic information, a part of it can be output additionally. The following attributes are added to "cons" or "tok" tags.

Attributes for "cons"
schemasubj_head, head_comp, spec_head, head_mod, mod_head, filler_head, head_relative, coord_left, coord_right, empty_filler_head, empty_spec_head, free_relativeapplied schema

Attributes for "tok"
tenseuntensed, past, presenttense of a verb
aspectnone, perfect, progressive, perfect-progressiveaspect of a verb
voiceactive, passivevoice of a verb
auxminus, modal, have, be, do, copularauxiliary verb or not
typepred, noun_mod, verb_mod, adj_mod, prep_mod, other_mod, pred_modsyntactic type
lexentry(see below)assigned lexical entry

All "cons" tags except preterminals (i.e., "cons" tags immediately above "tok") have non-empty values for "schema". All verbs have "aux" attributes, while principal verbs (i.e., aux="minus" or aux="copular") have "tense", "aspect", and "voice", whose values are non-empty strings. The "type" attribute expresses the syntactic type of a word: "pred" means predicative, "noun_mod", "verb_mod", "adj_mod", "prep_mod", and "other_mod" mean modifiers to nouns/verbs/adjectives/prepositions/other words, respectively, and "pred_mod" means a predicative modifier. If no "type" is assigned, it indicates that the word is an argument or the head of the sentence.

All "tok" tags have "lexentry". A value of "lexentry" is a lexeme name and applied lexical rules concatenated by hyphens. For example, when the value of "lexentry" is "[NP.nom<V.bse>NP.acc]_lxm-singular3rd_verb_rule" the lexical entry is obtained from the lexeme "[NP.nom<V.bse>NP.acc]_lxm" by applying the rule "singular3rd_verb_rule".

5. Appendix: Correspondences between Enju and PTB

Here is a rough sketch of correspondences of syntactic categories of Enju and Penn Treebank (PTB). It should be noted that this table does not necessarily mean that outputs of Enju can be formally translated into PTB-style outputs. Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses.

ADJP, QP (number expression)
ADJPRELWHADJP (relativizer)
ADJPFRELWHADJP (free relative)
ADJPWHWHADJP (wh-phrase)
ADVPRELWHADVP (relativizer)
ADVPFRELWHADVP (free relative)
ADVPWHWHADVP (wh-phrase)
SBAR (complementizer phrase)
NP (possessive), QP (quantifier)
NPRELWHNP (relativizer)
NPFRELWHNP (free relative)
NPWHWHNP (wh-phrase)
PPRELWHPP (relativizer)
PPWHWHPP (wh-phrase)
SRELSBAR (relative clause)
SFRELSBAR (free relative clause)
SBAR (subordinate clause)