GENIA Corpus with meta-knowledge annotation

The Meta-knowledge_GENIA_corpus directory contains a version of the entire GENIA event corpus, which has been enriched with meta-knowledge annotation. A more detailed description of this annotation, together with access to the annotation guidelines, is available here.

When downloading the corpus, please ensure that you adhere to the terms and conditions of the licences, which are contained within the LICENCES directory of the distribution. The licences can also be viewed here.

The Meta-knowledge_GENIA_corpus directory contains 3 subdirectories:

  • Corpus - Contains the annotated corpus in XML format. All files have been validated according to the DTD (see next bullet point).
  • ModifiedGENIAtypes - Contains modified versions of the DTD and CSS files provided with the original GENIA corpus. The modifications relate to the new meta-knowledge annnotation added. The CSS file, in particular, allows the annotated files to be displayed graphically in the X-Conc tool, which was used to perform the annotation.
  • GeniaOntologies - Contains the two GENIA ontologies encoded in OWL. The GENIAterm40.owl defines the term classes on which the GENIA term annotation is based. The GENIAevent.owl defines the event classes on which the GENIA event annotation is based.

The XML annotation of the corpus follows GENIA event annotation format, with additions to allow meta-knowledge to be encoded.

Two levels of annotation of the target text are expressed within each file, i.e.

  • text-bound event arguments and other annotated biological concepts
  • event annotations. It is at this level that modifications from the original GENIA annotation format, to allow meta-knowledge to be encoded

An example of an annotated sentence within the XML file is shown below:

<sentence id="S9">Nuclear transcription studies in vitro showed that 
	<term id="T28" lex="LTB4" sem="Organic_compound_other">LTB4</term> 
	increased the transcription of the 
	<term id="T29" lex="c-fos_gene" sem="DNA_domain_or_region">
           c-fos gene
	7-fold and the 
	<term id="T30" lex="c-jun_gene" sem="DNA_domain_or_region">
           c-jun gene
<event KT="Analysis" Manner="High" id="E30">
	<type class="Positive_regulation"/>
	<theme idref="E32"/>
	<cause idref="T28"/>
	<clue><clueExperiment>Nuclear transcription studies 
        in vitro</clueExperiment> <clueKT>showed</clueKT> that LTB4 
        <clueType>increased</clueType> the transcription of the 
	c-fos gene <clueManner>7-fold</clueManner> and the c-jun gene 
<event KT="Analysis" Manner="Low" id="E31">
	<type class="Positive_regulation"/>
	<theme idref="E33"/>
	<cause idref="T28"/>
	<clue><clueExperiment>Nuclear transcription studies 
        in vitro</clueExperiment> <clueKT>showed</clueKT> that LTB4 
        <clueType>increased</clueType> the transcription of the 
	c-fos gene 7-fold and the c-jun gene 
<event KT="Other" id="E32">
	<type class="Transcription"/>
	<theme idref="T29"/>
	<clue>Nuclear transcription studies in vitro showed that LTB4 
        increased the <clueType>transcription</clueType> 
        <linkTheme>of</linkTheme> the c-fos gene 7-fold and 
	the c-jun gene 1.4-fold.</clue>
<event KT="Other" id="E33">
	<type class="Transcription"/>
	<theme idref="T30"/>
	<clue>Nuclear transcription studies in vitro showed that 
        LTB4 increased the <clueType>transcription</clueType> 
        <linkTheme>of</linkTheme> the c-fos gene 7-fold 
	and the c-jun gene 1.4-fold.</clue>

Below, we provide below a brief description of the above XML representation, in terms of the orginal GENIA annition, and the information added to represent meta-knowledge.

Original GENIA annotation

Each sentence of the abstract is contained within a <sentence> element. Biological concepts are annotated inline, indicated by <term> elements. Each <term> element has the following attributes:

  • sem - The biological concept type assigned to the span. Concept types belong to to the GENIA Term Ontology.
  • id - A unique id for the span, beginning with "T"
  • lex - The value of the text span, with spaces replaced with underscores

Following the <sentence> element, the events in the sentence are listed, each within an <event> element. Each event has a unique id, starting with an "E". Within the <event> element, there are the following elements:

  • type- The type assigned to the event is indicated by the class attribute. Event types belong to the according to the GENIA Event Ontology.
  • semantic arguments - There is an element corresponding to each annotated argument of the event, which is named according to the semantic role assigned to the argument, e.g. theme, cause, etc. Each element has one or more attributes, whose values correspond the id(s) of the argument which fills the role. The attributes are named idref, idref1, idref2 etc. The value ot each attibute begins either with a "T", indicating that the argument span corresponds to one of the <term> elements, or it may begin with an "E", indicating that the argument is an embedded event whose structure is described in another <event> element.
  • clue - This element contains the complete sentence containing the event. The <clueType> element surrounds the verb/nominalised verb on which the event is centred. Several other elements may be present, which consititute lexical clues that are used to help identify different types of information about the event. Some of these relate to meta-knowledge annotation (see below), while others were added as part of the original GENIA event annotation. These include <clueExperiment>, which are text spans that describe experimental methods, and <linkTheme>, which are text spans that provide a link between the <clueType> element and the theme of the event. The complete list of orginal GENIA elements that can occur within the <clue> element can be found in the GENIA event annotation guidelines.

Meta-knowledge annotation

Meta-knowledge annotation is encoded in two places in the XML:

  • As attributes of the <event> element, to encode the values assigned to the 5 meta-knowledge dimensions. These are encodeed using the attributes KT (Knowledge Type), CL (Certainty Level), Polarity, Manner and Source. Note that these attributes only appear if there annotated values vary from the default values (as defined in the DTD).
  • As elements within the <clue> element, to denote clue phrases (if any) used to determine the values of the different meta-knowledge dimensions. The elements that may appear are: <clueKT>, <clueCL>, <cluePolarity>, <clueManner> and <clueSource>.