BioCause annotation

View the corpus online with the brat rapid annotation tool.

The BioCause_corpus directory contains a version of the entire ID corpus, which has been enriched with causality annotation. A more detailed description of this annotation, together with access to the annotation guidelines, is available here.

When downloading the corpus, please ensure that you adhere to the terms and conditions of the licences, which are contained within the LICENCES file of the distribution. The causality annotations within BioCause are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, the University of Manchester, UK. These annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The BioCause_corpus directory contains files of two types.

  • .txt - Contains the text files used for the annotation.
  • .ann - Contains the annotated text in stand-off format.
The original 19 articles are split into sections, each section being represented as a different document. Thus, in total, there are 198 .txt files. To each of these .txt files corresponds an .ann annotation file, with the same file basename.

The .ann files contain named entity, event and causality annotations formatted according to the BioNLP 2011 ST style. In the case of terms, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

In the case of events, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g., Effect, Cause, Theme, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

An example of an annotated causal relation within the .ann file is shown below:

T139  Argument 3854 3963  Mlc is a global regulator of carbohydrate 
                          metabolism and controls several genes 
                          involved in sugar utilization
T140  Trigger 3973 3982   Therefore
T141  Argument 4008 4052  Mlc also affects the virulence of Salmonella
E48   Trigger:T140 Evidence:T139 Effect:T141