What makes a knowledge resource, like a domain model, thesaurus, or ontology, effective for term-level text annotation in the Biology domain? In this work we compare several approaches to ontology design with examples from well-known resources such as OBO, MeSH, and Genia. Based on these comparisons we establish goals for a knowledge resource that supports term-level text annotation and text mining: such a resource should represent terms and relations that are expressed in contiguous spans of text, and its terms should bear meaningful correspondences with other knowledge resources. Finally, we trace how these two goals have affected the re-design of the Genia Ontology over several iterations. The result is a new term hierarchy and a new design process, both specifically tailored to term-level text annotation. This research explores practical influences on the design of knowledge resources for Bio-NLP systems.
We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, DevelopmentalStage, Disease, DrugCompound, ExperimentalMethod, Fragment, Fusion, GOMOP, Gene, Modification, mRNAcDNA, Mutant, Protein, Tissue), normalisations of selected entities to the NCBI Taxonomy, RefSeq, EntrezGene, ChEBI and MeSH and enriched relations (protein-protein interactions, tissue expressions and fragment- or mutant-protein relations). While one corpus targets protein-protein interactions (PPIs), the focus of other is on tissue expressions (TEs). This paper describes the selected markables and the annotation process of the ITI TXM corpora, and provides a detailed breakdown of the inter-annotator agreement (IAA).
A significant amount of important information in Electronic Health Records (EHRs) is often found only in the unstructured part of patient narratives, making it difficult to process and utilize for tasks such as evidence-based health care or clinical research. In this paper we describe the work carried out in the CLEF project for the semantic annotation of a corpus to assist in the development and evaluation of an Information Extraction (IE) system as part of a larger framework for the capture, integration and presentation of clinical information. The CLEF corpus consists of both structured records and free text documents from the Royal Marsden Hospital pertaining to deceased cancer patients. The free text documents are of three types: clinical narratives, radiology reports and histopathology reports. A subset of the corpus has been selected for semantic annotation and two annotation schemes have been created and used to annotate: (i) a set of clinical entities and the relations between them, and (ii) a set of annotations for time expressions and their temporal relations with the clinical entities in the text. The paper describes the make-up of the annotated corpus, the semantic annotation schemes used to annotate it, details of the annotation process and of inter-annotator agreement studies, and how the annotated corpus is being used for developing supervised machine learning models for IE tasks.
The accurate recognition of modal information is vital for the correct interpretation of statements. In this paper, we report on the collection of words and phrases that express modal information in biomedical texts, and propose a categorisation scheme according to the type of information conveyed. We have performed a small pilot study through the annotation of 202 MEDLINE abstracts according to our proposed scheme. Our initial results suggest that modality in biomedical statements can be predicted fairly reliably through the presence of particular lexical items, together with a small amount of contextual information.
Building a large lexical resource which pools together terms of different semantic types naturally leads to the consideration of term ambiguity issues. Both cross- and intra-domain polysemy constitutes a formidable obstacle in tasks and applications such as Named Entity Recognition or Information Retrieval (e.g. query expansion or relevance feedback), where a single polysemous term may cause a significant concept drift. One of the biggest sources of polysemy in biomedical terminologies are protein and gene names (PGN), both those extracted from existing databases and their variants found in the literature. We provide an analysis where the effect of using static dictionary features for the detection of potential polysemy of protein and gene names (and by extension other semantic types) is clearly delineated from the contribution of applying contextual features. We argue that, although disambiguation based on static dictionary features does not outperform fully-fledged context-driven Named Entity Recognition, it does effectively filter out highly polysemous terms (increase in F-measure from 0.06 to 0.57 and from 0.21 to 0.51 as measured on two evaluation corpora). Moreover, static dictionary features are context-independent and thus more easily applicable in systems where running intense on-the-fly disambiguation for retrieved documents could be problematic.
In this paper we investigate the manual subclassification of chemical named entities into subtypes representing whole compounds, parts of compounds and classes of compounds. We present a set of detailed annotation guidelines, and demonstrate their reproducibility by performing an inter-annotator agreement study on a set of 42 chemistry papers. The accuracy and kappa for the annotating the subtypes of the majority named entity type were 86.0% and 0.784 respectively, indicating that consistent manual annotation of these phenomena is possible. Finally, we present a simple system that can make these judgments with accuracy of 67.4% and kappa of 0.470.
Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. The recognition of these named entities relies on appropriate dictionary resources as well as on training and evaluation corpora. In this work we give an overview of publicly available chemical information resources with respect to chemical terminology. The coverage, amount of synonyms, and especially the inclusion of SMILES or InChI are considered. Normalization of different chemical names to a unique structure is only possible with these structure representations. In addition, the generation and annotation of training and testing corpora is presented. We describe a small corpus for the evaluation of dictionaries containing chemical entities as well as a training and test corpus for the recognition of IUPAC and IUPAC-like names, which cannot be fully enumerated in dictionaries. Corpora can be found on http://www.scai.fraunhofer.de/chem-corpora.html.
Human anatomy knowledge is an integral part of radiological information, which is necessary for image annotation in a semantic cross-modal image and information retrieval scenario. Anatomy and radiology related concepts and relations can be discovered from an anatomy corpus, which can be build up from Wikipedia as reported here. An ontology of human anatomy and a controlled vocabulary for radiology are used as knowledge resources in the search of significant concepts and relations. Our ultimate goal is to use the concepts and the relationships discovered in this way to identify potential query patterns. These query patterns are the abstractions of the actual queries that radiologists and clinicians would typically pose to a semantic search engine to find patient-specific sets of relevant images and textual data.