The Alchemy of Annotation: When Biologists Disagree


Ewan Klein
(School of Informatics, University of Edinburgh)

Many current approaches to text processing for the biomedical literature involve some kind of manual annotation of text. This is necessary for those techniques that involve supervised machine learning. And even approaches which manage to do without training data typically require manually annotated `Gold standard' test data. The quality of the manually annotated data is a crucial factor in the quality of the models learned from training data and for the accuracy of measuring systems against test data, and this quality is usually assessed in terms of Inter-annotater Agreement (IAA). Within the research literature on biomedical text mining, IAA for named entities is frequently reported and is usually fairly good (though less high than for Newswire). By contrast, IAA for relation extraction is less frequently reported, and when it is reported, tends to be considerably lower than for named entities.

This paper will report on the IAA obtained from an exercise in which four biologists annotated 750 abstracts and 150 full-text papers for protein-protein interactions. A sample of 5% of these documents were doubly-annotated. In addition, 16 documents were marked up by all four annotators in an initial training phase. We present an analysis of where the annotators disagreed, both on entities and on relations between those entities. Although a variety of factors conspire to lower agreement for relations, we will show that most of them involve an interplay between ontological judgements by the biologist, linguistic characteristics of the text, the clarity of the annotation guidelines, and ergonomic aspects of the annotation task. Although annotating relations is intrinsically hard, we believe that an analysis of this sort can point to ways of improving IAA .