BioNLP'09 Shared Task on Event Extraction

Tools integrated with U-compare

Tools for various types of text processing that may be of interest to the shared task participants are primarily made available through U-compare. Please see the U-compare page for the BioNLP'09 shared task for more information.

Other tools

In addition to the tools offered for use through U-compare, we provide participants with access to the syntactic analyses created by a number of parsers that are not yet integrated with the platform. The analyses are provided in two different primary formats: phrase structure (constituency) in the standard bracketed Penn Treebank format and dependency using the dependency output format of the Stanford parser. Additionally, the "native" output format of parsers not directly using these output formats is provided for reference.

We have selected the following parsers:

Dan Bikel's implementation of the Collins' parsing model. ("Bikel")
The Charniak-Johnson reranking parser using David McClosky's self-trained biomedical parsing model. ("McClosky-Charniak")
GDep, a version of the LRDEP/KSDEP native dependency parser trained on the GENIA Treebank
C&C CCG parser

File formats for data distribution

The following file formats are used in the data:

Penn Treebank (PTB) format

The output of the phrase structure parsers (Bikel and McClosky-Charniak) is provided in the standard PTB format. An example follows.

(S1 (S (NP (NNS Interferons))
     (VP (VBP inhibit)
      (NP (NP (NN activation)) (PP (IN of) (NP (NN STAT6))))
      (PP (PP (IN by)
           (NP (NP (NN interleukin) (CD 4))
            (PP (IN in) (NP (JJ human) (NNS monocytes)))))
       (PP (IN by)
        (S (VP (VBG inducing)
            (NP (NN SOCS-1) (NN gene) (NN expression)))))))
     (. .)))

(In the package, a sentence-per-line format is used instead of a "pretty-printed" format such as this to simplify processing.)

Stanford dependency (SD) format

Dependency analyses, including dependency representations of the output of the phrase structure parsers as converted with the Stanford tools, are provided in the format output by the Stanford parser. Example:

nsubj(inhibit-2, Interferons-1)
dobj(inhibit-2, activation-3)
prep_of(activation-3, STAT6-5)
prep_by(inhibit-2, interleukin-7)
num(interleukin-7, 4-8)
amod(monocytes-11, human-10)
prep_in(interleukin-7, monocytes-11)
dep(inhibit-2, by-12)
pcomp(by-12, inducing-13)
nn(expression-16, SOCS-1-14)
nn(expression-16, gene-15)
dobj(inducing-13, expression-16)

In this representation, sentences are separated by empty lines. As the format references tokens by index (starting from 1) and the Stanford tools do not always include references to all tokens (e.g. punctuation), files with the tokenization used are provided for reference.

Please note that while the output of the GDep parser is provided in this format, it does not use the Stanford dependency scheme and no mapping between the two schemes has been attempted. Thus, the GDep analyses have e.g. SUB(inhibit-2, Interferons-1) instead of nsubj(inhibit-2, Interferons-1).

Tokenized text

The tokenized text input for parsing that is automatically generated using tools integrated into U-compare for sentence segmentation and tokenization is provided for reference. A simple sentence-per-line format is used with whitespace separating tokens.

Interferons inhibit activation of STAT6 by interleukin 4 in human monocytes by inducing SOCS-1 gene expression .
Interferons -LRB- IFNs -RRB- inhibit induction by IL-4 of multiple genes in human monocytes .
However , the mechanism by which IFNs mediate this inhibition has not been defined .

Please note that PTB escapes (e.g. "-LRB-" for "(") are used in the data and the text of the analyses will thus not directly match the source texts.

Other "native" output formats

As the GDep and C&C CCG parsers do not directly used either the Penn Treebank or the Stanford dependency format, their original outputs will be provided for reference in the CoNLL-X format and CCG output format, respectively. We refer to the documentation of these tools and resources for information.

Download

The analyses available for download below are (revision 2 17/8/2009):

Bikel: Penn Treebank (native) and Stanford dependency (converted with Stanford tools)
McClosky-Charniak: Penn Treebank (native) and Stanford dependency (converted with Stanford tools)
GDep: CoNLL-X output (native) and representation in the Stanford dependency format (custom conversion)
CCG: CCG dependencies and lexical categories (native) and Stanford dependency format (converted with a variant of the conversion of Rimell and Clark (2008) [preprint])

The contents are:

tokenized/PMID.tokenized: Tokenized files, tokenized text format.
Bikel/: Subdirectory for analyses of the Bikel parser.
Bikel/pstree/PMID.pstree: Phrase structure analyses, PTB format.
Bikel/dep/PMID.dep: Dependency analyses, SD format.
McClosky-Charniak/: Subdirectory for analyses of the McClosky-Charniak parser.
McClosky-Charniak/pstree/PMID.pstree: Phrase structure analyses, PTB format.
McClosky-Charniak/dep/PMID.dep: Dependency analyses, SD format.
GDep/: Subdirectory for analyses of the GDep parser.
GDep/dep/PMID.dep: Dependency analyses, SD format.
GDep/original/PMID.CoNNL: Dependency (and other) analyses, CoNLL-X format.
CCG/: Subdirectory for analyses of the CCG parser.
CCG/dep/PMID.dep: Dependency analyses, SD format.
CCG/original/PMID.ccg: CCG dependency analyses and lexical categories, custom format.

All analyses use the sentence splitting and tokenization in tokenized/, generated automatically with domain tools integrated into U-compare. Our thanks to Laura Rimell for providing the CCG analyses.

The analyses of the PTB parsers were converted into dependency using version 1.6.1 of the Stanford parser, generating the "collapsed" Stanford dependency representation (in the first revision, default settings were used, in the second revision the -CCprocessed option was added; see below).

The older analyses remain available for reference: revision 1 train, devel, test. Revision 2 differs from revision 1 in two respects. First, The conversion from phrase structure to the Stanford dependency representation (for McClosky-Charniak and Bikel) was performed with the -CCprocessed option, which propagates conjunct dependencies: e.g. in "A binds B and C" this introduces an additional dobj(binds,C) dependency. Second, the applied version of the McClosky-Charniak parser generates some non-standard POS tags (AUX and AUXG) which were not recognized by the Stanford tools. In the second revision, these were mapped to standard PTB tags using a simple deterministic mapping prior to conversion into the Stanford dependency representation. Our thanks to Christopher Manning for suggesting these improvements. Please refer to the Stanford typed dependencies manual (de Marneffe and Manning 2008) for more information on the Stanford conversion.

Example data analyses

(The old analyses for the small example data sample also remain available at bionlp09_shared_task_sample_analyses.tar.gz).

Contents

Contact