NaCTeM

Resources


Multi-Level Event Extraction (MLEE)


Example annotation marking a drug ("Thalidomide") having a negative effect on tissue development ("formation of capillary tubes").

Overview

Event extraction is a popular approach to the extraction of structured information from biomedical domain texts, and manually annotated corpora are necessary for the development and evaluation of event extraction methods.

Recent biomedical domain event extraction efforts have focused almost exclusively on molecular level entities and events. To extend the capabilities of event extraction approaches, we have created the Multi-Level Event Extraction (MLEE) corpus and tools for entity mention detection and event extraction across levels of biological organization from the molecular to the organ system level. This work has been carried out as part of the Automated Biological Event Extraction from the Literature for Drug Discovery project, which is a collaboration between NaCTeM and AstraZenca.

Experiments using the NERsuite entity mention tagger and EventMine event extraction system indicate that the new resource allows existing methods to address multi-level event extraction at a level of performance broadly comparable to that achieved in established molecular-level extraction tasks.


Quick links

Manuscript introducing the corpus and associated resources:

The corpus, tools and associated resources:


Background

Event extraction using expressive structured representations such as that of the BioNLP Shared Task has been gaining popularity in biomedical information extraction, and multiple event extraction corpus resources, tools and automatically analysed literature databases have been introduced.

In the recent BioNLP Shared Task events in 2009 and 2011, event-annotated corpora for many subdomains of biomedical science have been introduced, including the following:

Although the corpora and the wealth of extraction systems addressing these tasks introduced by task participants for these areas have demonstrated the event extraction approach to be applicable to many areas of biomedical science, these previous efforts are all focussed on molecular-level entities and events.

To be able to present a comprehensive picture of the workings of biological systems, information extraction approaches must take into account not only the molecular-level reactions but also the cellular, tissue, and organ-level processes that produce the organism-level effects that are of primary interest in much of biomedical domain research.


Contributions

MLEE corpus

To extend the coverage of the event extraction approach to domain information extraction, we have introduced the Multi-Level Event Extraction (MLEE) corpus, consisting of manually annotated abstracts of publications on angiogenesis, the development of new blood vessels from existing ones, an area of high interest in cancer research.

The corpus annotation was created with reference to previously introduced annotation created by subdomain experts to identify spans of text that expressing statements relevant to their interests. To create the MLEE corpus, we have established ontological foundations for the annotation with reference to the community-standard OBO Foundry resources such as the Gene Ontology (GO) and the Common Anatomy Reference Ontology (CARO), revising existing span annotations accordingly to identify over 8,000 entities with fine-grained types and introducing structured annotation for over 6,000 events.

The full MLEE corpus can be browsed online using brat (supported browsers).

To download the corpus data, see availability below.

Entity mention detection

We used the MLEE corpus and major domain lexical and ontological resources to evaluate the feasibility of the automatic detection of mentions of entities ranging from molecules to tissues to organ systems and organisms with 16 fine-grained ontology-based types, including the following:

  • Organism
  • Organism subdivision
  • Anatomical system
  • Organ
  • Multi-tissue structure
  • Tissue
  • Cell
  • Cellular component
  • Organism substance
  • Pathological formation
  • Gene or gene product

Experiments using the state-of-the-art CRF-based entity mention detection system NERsuite demonstrated that these entities can be detected at 77% precision and 69% recall (73% F-score) using exact matching criteria and only the MLEE corpus annotations, and 85% precision and 79% recall (82% F-score) under approximate boundary matching criteria when supported by dictionaries extracted from UMLS, Entrez Gene and OBO Foundry ontology resources.

The result compares favorably to those of domain multi-type entity mention detection evaluations such as the BioNLP/JNLPBA shared task, indicating that the entity annotations have high consistency and the task is feasible for existing entity mention methods.

The entity mention detection system is freely available under the open source MIT license from the NERsuite homepage.

For the supporting lexical resources, please see availability below.

Event extraction

We used the MLEE corpus and the established molecular-level event-annotated GE corpus to evaluate the feasibility of the automatic extraction of events involving the annotated entity types. The MLEE event annotation extends on molecular-level events targeted in previous event extraction efforts, annotating a broad-coverage set of 29 event types ranging from the molecular to the anatomy and organism level, defined primarily with reference to the Gene Ontology. The targeted events include the following (w/informal scope):

  • Development (organism / anatomical structure)
  • Growth (organism / anatomical structure)
  • Remodeling (anatomical structure)
  • Breakdown (anatomical structure)
  • Death (organism / anatomical structure)
  • Cell proliferation (cell)
  • Cell division (cell)
  • Localization (cell or biomolecule)
  • Binding (cell or biomolecule)
  • Metabolism (biomolecule)
  • Synthesis (biomolecule)
  • Gene expression (biomolecule)
  • Phosphorylation (biomolecule)

Experiments using the state-of-the-art EventMine event extraction system showed that the targeted events can be extracted with 57% precision and 49% recall (52% F-score) in the evaluation setting and matching criteria of the BioNLP Shared Task when training only on the MLEE corpus. Further, we found that using using "stacking" with a model trained on the GE corpus data, performance can be slightly improved to 53% F-score, suggesting that the introduced annotations are compatible with those of major existing molecular-level event extraction resources.

The EventMine system is available as a web service from the EventMine homepage.

The GE corpus is freely available from the BioNLP Shared Task 2011 GE task homepage.

Integrated extraction system

Finally, we have created an integrated extraction system to demonstrate how the MLEE-based entity mention detection and event extraction methods can be used to create structured analyses of simple text output. This system is further integrated with the brat annotation visualization tool to create intuitive visualizations of the extracted information.

Demo coming soon!


Availability

Corpus annotations

The MLEE corpus is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) licence.

You should attribute the corpus by citing the paper in the References section below.

Extraction methods

  • NERsuite is freely available under the open source MIT license.
  • EventMine is available as a web service.

Supporting resources

The dictionary resources applied in this work are derived from the following database, lexical and ontological resources:

These resources are copyright their creators and licensed separately. While we cannot redistribute the derived lexical resources, all of these resources are freely available for use in research from their respective providers.

(Please contact Sampo Pyysalo if you require support in the extraction of dictionaries from the various database and other file formats in which these resources are distributed.)

References

If you use the MLEE corpus or related tools in your work, please cite the following paper:


Contributors

  • Sampo Pyysalo (NaCTeM and University of Manchester): senior researcher
  • Tomoko Ohta (NaCTeM and University of Manchester): annotation coordinator
  • Makoto Miwa (NaCTeM and University of Manchester): event extraction (EventMine)
  • Han-Cheol Cho (University of Tokyo and Tohoku University): entity mention recognition (NERsuite)
  • Jun'ichi Tsujii (Microsoft Research Asia): principal researcher
  • Sophia Ananiadou (NaCTeM and University of Manchester): principal investigator

Acknowledgments

This work is funded by UK Biotechnology and Biological Sciences Research Council (BBSRC) under project Automated Biological Event Extraction from the Literature for Drug Discovery (reference number: BB/G013160/1).