NaCTeM

Anatomy Corpora

Anatomical entities are central to much of biomedical discourse and must be considered in any attempt to fully analyse biomedical scientific text. However, while a wealth of tools and resources have been introduced in domain natural language processing efforts for the recognition of molecular level entity (gene, protein, chemical) and organism name mentions in text, there has been little study of the recognition of mentions of anatomical entities such as tissues and organs.

To address this issue and to facilitate more detailed and comprehensive analysis of biomedical scientific text, our aim has been to to establish a fine-grained, species-independent anatomical entity mention detection task.

We have developed a number of manually-annotated corpora to support the above aim, as follows:

  • Multi-Level Event Extraction (MLEE) corpus - abstracts of publications on angiogenesis, annotated with entity mentions and events across multiple levels of biological organization from the molecular to the organ system level. Over 8,000 entities with fine-grained types and over 6,000 structured events are annotated.
  • AnEM corpus - a domain- and species-independent resource, annotated with anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions.
  • Extended Anatomical Entity Mention (AnatEM) corpus - 1212 documents (approx. 250,000 words) annotated with over 13,000 mentions of anatomical entities. Each annotation is assigned one of 12 granularity-based types such as Cellular component, Tissue and Organ, defined with reference to the Common Anatomy Reference Ontology. The corpus builds in part on the AnEM and MLEE corpora.