
BioCause Corpus
Background
Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.
BioCause annotation scheme
We have defined an annotation scheme that aims to enrich events in the ID event corpus (as well as other corpora annotated with biomedical events) with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of a previous shared task, BioNLP 2011 ST ID.
The BioNLP 2011 ST ID corpus consists of 19 full-text documents that have been manually annotated with biomedical entities and events. The annotations provide classified, structured representations of relationships between biomedical terms, and as such, the corpus consitututes a valuable resource for the training of IE systems.
The original version of the ID corpus concentrated on the following:
- Identification of the event trigger (the word or phrase around which the event is organised)
- Assignment of the event type
- Identification of the event participants, usually:
- THEME: the entity or event affected by the current event
- CAUSE: the entity or event that causes the current event to occur
On top of these events, we have added causality annotations. The annotation structure is similar to that of an event:
- Identification of the causal trigger (the word or phrase around which the relation is organised). This can be an empty trigger too, case in which a zero-length span is placed in between the arguments.
- Identification of the relation arguments, usually:
- CAUSE: the span of text describing the situation that causes the current event to occur
- EFECT: the span of text describing the situation occurring because of the current event
The complete version of the ID corpus enriched with causality annotation is available for download. NOTE: Please observe the terms of the BioCause corpus licence when downloading the corpus.
BioCause corpus licence
1. Copyright of abstracts
Any abstracts contained in this corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).
NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.
NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.
2. Copyright of full texts
Any full texts contained in this corpus are from the PMC Open Access Subset of PubMed Central (PMC), the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature.
Articles in the PMC Open Access Subset are protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific license terms.
3. Copyright of Named Entity and Event Annotations
See the GENIA Project License for Annotated Corpora4. Copyright of BioCause Annotations
Please attribute the corpus by citing the following paper:
Mihăilă, C., Ohta, T., Pyysalo, S. and Ananiadou, S. (2013) BioCause: Annotating and analysing causality in the biomedical domain. In BMC Bioinformatics, 14(1):2.
Featured News
- BioNLP 2024 accepted as workshop at ACL 2024
- Prof. Ananiadou appointed as Senior Area Chair for ACL 2023 and IJCNLP-AACL 2023
- New Knowledge Transfer Partnership with 10BE5
- Chinese Government AwardAward for PhD student Tianlin Zhang
- Advances in Data Science and AI Conference 2023
- Talk at Open Data Science Conference (ODSC)
- BioLaySumm 2023 - Shared Task @ BioNLP 2023
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon