NaCTeM

Home Aims & Objectives NaCTeM Services Software Services Customisation Text Mining Tools Text Mining Infrastructures U-Compare Argo Text Mining for Biodiversity Mining Biodiversity Project COPIOUS Project Biodiversity Inventory Resources Corpora ACE Meta-Knowledge Anatomy Corpora BioCause ChEBI CHR Controllable Readability COPD GENIA GENIA Meta-Knowledge GREC HIMERA Metabolite and Enzyme MC-Fake Occupational Exposure PHAEDRA PhenoCHF Terminologies Time-sensitive Medical Inventory Other Resources Chinese Biomedical Bio-Lexicon Anatomy Resources Evaluation Terms & Conditions FAQ General TerMine Cheshire TerMine/Cheshire News & Events News What others are saying about us Press and Journal Mentions NaCTeM Seminars People Projects Current Projects AIRC British Heart Foundation EPHOR Mental Health NEDO-AIRC Past Projects 10be5 ADVISES Arabic WordNet ASSIST ASSERT AstraZeneca Project Automated screening for systematic reviews BBC Big Mechanism BOOTStrep Bott and Co. CheTA Clinical Trials COPIOUS DECA eScholar EMPATHY Europe PMC FixRep FLaReNet Graphene HSE Lloyds Infectious Diseases INTUTE ISHER KISTI Pathway META-NET Mining for Public Health Mining the History of Medicine MMPathIC NCS TOX ONDEX OpenMinTeD OSSMETER Pacific Life Re PathText/Refine SLiM Thalia Turing Project Publications Community External Collaboration Vacancies Teaching & Tutorials Contact Us How to Find Us

BioCause

BioCause annotation

View the corpus online with the brat rapid annotation tool.

The BioCause_corpus directory contains a version of the entire ID corpus, which has been enriched with causality annotation. A more detailed description of this annotation, together with access to the annotation guidelines, is available here.

When downloading the corpus, please ensure that you adhere to the terms and conditions of the licences, which are contained within the LICENCES file of the distribution. The causality annotations within BioCause are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, the University of Manchester, UK. These annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The BioCause_corpus directory contains files of two types.

  • .txt - Contains the text files used for the annotation.
  • .ann - Contains the annotated text in stand-off format.
The original 19 articles are split into sections, each section being represented as a different document. Thus, in total, there are 198 .txt files. To each of these .txt files corresponds an .ann annotation file, with the same file basename.

The .ann files contain named entity, event and causality annotations formatted according to the BioNLP 2011 ST style. In the case of terms, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

In the case of events, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g., Effect, Cause, Theme, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

An example of an annotated causal relation within the .ann file is shown below:

T139  Argument 3854 3963  Mlc is a global regulator of carbohydrate 
                          metabolism and controls several genes 
                          involved in sugar utilization
T140  Trigger 3973 3982   Therefore
T141  Argument 4008 4052  Mlc also affects the virulence of Salmonella
E48   Trigger:T140 Evidence:T139 Effect:T141