GENIA Corpus with meta-knowledge annotation
The Meta-knowledge_GENIA_corpus directory contains a version of the entire GENIA event corpus, which has been enriched with meta-knowledge annotation. A more detailed description of this annotation, together with access to the annotation guidelines, is available here.
When downloading the corpus, please ensure that you adhere to the terms and conditions of the licences, which are contained within the LICENCES directory of the distribution. The licences can also be viewed here.
The Meta-knowledge_GENIA_corpus directory contains 3 subdirectories:
- Corpus - Contains the annotated corpus in XML format. All files have been validated according to the DTD (see next bullet point).
- ModifiedGENIAtypes - Contains modified versions of the DTD and CSS files provided with the original GENIA corpus. The modifications relate to the new meta-knowledge annnotation added. The CSS file, in particular, allows the annotated files to be displayed graphically in the X-Conc tool, which was used to perform the annotation.
- GeniaOntologies - Contains the two GENIA ontologies encoded in OWL. The GENIAterm40.owl defines the term classes on which the GENIA term annotation is based. The GENIAevent.owl defines the event classes on which the GENIA event annotation is based.
The XML annotation of the corpus follows GENIA event annotation format, with additions to allow meta-knowledge to be encoded.
Two levels of annotation of the target text are expressed within each file, i.e.
- text-bound event arguments and other annotated biological concepts
- event annotations. It is at this level that modifications from the original GENIA annotation format, to allow meta-knowledge to be encoded
An example of an annotated sentence within the XML file is shown below:
<sentence id="S9">Nuclear transcription studies in vitro showed that <term id="T28" lex="LTB4" sem="Organic_compound_other">LTB4</term> increased the transcription of the <term id="T29" lex="c-fos_gene" sem="DNA_domain_or_region"> c-fos gene </term> 7-fold and the <term id="T30" lex="c-jun_gene" sem="DNA_domain_or_region"> c-jun gene </term> 1.4-fold. </sentence> <event KT="Analysis" Manner="High" id="E30"> <type class="Positive_regulation"/> <theme idref="E32"/> <cause idref="T28"/> <clue><clueExperiment>Nuclear transcription studies in vitro</clueExperiment> <clueKT>showed</clueKT> that LTB4 <clueType>increased</clueType> the transcription of the c-fos gene <clueManner>7-fold</clueManner> and the c-jun gene 1.4-fold.</clue> </event> <event KT="Analysis" Manner="Low" id="E31"> <type class="Positive_regulation"/> <theme idref="E33"/> <cause idref="T28"/> <clue><clueExperiment>Nuclear transcription studies in vitro</clueExperiment> <clueKT>showed</clueKT> that LTB4 <clueType>increased</clueType> the transcription of the c-fos gene 7-fold and the c-jun gene <clueManner>1.4-fold</clueManner>.</clue> </event> <event KT="Other" id="E32"> <type class="Transcription"/> <theme idref="T29"/> <clue>Nuclear transcription studies in vitro showed that LTB4 increased the <clueType>transcription</clueType> <linkTheme>of</linkTheme> the c-fos gene 7-fold and the c-jun gene 1.4-fold.</clue> </event> <event KT="Other" id="E33"> <type class="Transcription"/> <theme idref="T30"/> <clue>Nuclear transcription studies in vitro showed that LTB4 increased the <clueType>transcription</clueType> <linkTheme>of</linkTheme> the c-fos gene 7-fold and the c-jun gene 1.4-fold.</clue> </event>
Below, we provide below a brief description of the above XML representation, in terms of the orginal GENIA annition, and the information added to represent meta-knowledge.
Original GENIA annotation
Each sentence of the abstract is contained within a <sentence> element. Biological concepts are annotated inline, indicated by <term> elements. Each <term> element has the following attributes:
- sem - The biological concept type assigned to the span. Concept types belong to to the GENIA Term Ontology.
- id - A unique id for the span, beginning with "T"
- lex - The value of the text span, with spaces replaced with underscores
Following the <sentence> element, the events in the sentence are listed, each within an <event> element. Each event has a unique id, starting with an "E". Within the <event> element, there are the following elements:
- type- The type assigned to the event is indicated by the class attribute. Event types belong to the according to the GENIA Event Ontology.
- semantic arguments - There is an element corresponding to each annotated argument of the event, which is named according to the semantic role assigned to the argument, e.g. theme, cause, etc. Each element has one or more attributes, whose values correspond the id(s) of the argument which fills the role. The attributes are named idref, idref1, idref2 etc. The value ot each attibute begins either with a "T", indicating that the argument span corresponds to one of the <term> elements, or it may begin with an "E", indicating that the argument is an embedded event whose structure is described in another <event> element.
- clue - This element contains the complete sentence containing the event. The <clueType> element surrounds the verb/nominalised verb on which the event is centred. Several other elements may be present, which consititute lexical clues that are used to help identify different types of information about the event. Some of these relate to meta-knowledge annotation (see below), while others were added as part of the original GENIA event annotation. These include <clueExperiment>, which are text spans that describe experimental methods, and <linkTheme>, which are text spans that provide a link between the <clueType> element and the theme of the event. The complete list of orginal GENIA elements that can occur within the <clue> element can be found in the GENIA event annotation guidelines.
Meta-knowledge annotation
Meta-knowledge annotation is encoded in two places in the XML:
- As attributes of the <event> element, to encode the values assigned to the 5 meta-knowledge dimensions. These are encodeed using the attributes KT (Knowledge Type), CL (Certainty Level), Polarity, Manner and Source. Note that these attributes only appear if there annotated values vary from the default values (as defined in the DTD).
- As elements within the <clue> element, to denote clue phrases (if any) used to determine the values of the different meta-knowledge dimensions. The elements that may appear are: <clueKT>, <clueCL>, <cluePolarity>, <clueManner> and <clueSource>.
Featured News
- 24-month postdoctoral research position in Athens, Greece
- PhD opportunity in collaboration with Athens Univ. of Economics and Business
- iCASE EPSRC funded PhD- multimodal NLP - UoM & BAE - Application deadline 30th April 2024
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- CFP: BIONLP 2024 and Shared Tasks @ ACL 2024
- Advances in Data Science and Artificial Intelligence Conference 2024
Other News & Events
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- New review article on emotion detection for misinformation
- BioNLP 2024 accepted as workshop at ACL 2024
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
- Chinese Government AwardAward for PhD student Tianlin Zhang