PHAEDRA corpus

Description

The PHAEDRA corpus is a semantically annotated corpus for pharmacovigilence (PV), consisting of 597 MEDLINE abstracts. Its fine-grained, multiple levels of annotation, added by domain-experts, make it a unique resource within the field, and aim to encourage the development/adaption of novel machine learning tools for extracting PV-related information from text. It is intended that such tools will lead to novel means of supporting curators to efficiently increase the coverage, consistency and completeness of the information in PV resources.

The corpus includes five different levels of information, which allow detailed information about drug effects to be encoded.

  • Named entities that participate in the description of drug effects.
    • Some categories of named entities have been automatically linked to concept identifiers in domain-specific terminological resources through the application of an automatic normalisaton method.
  • Events that encode descriptions of drug effects.
  • Interpretative attributes assigned to events to denote whether the event is negated and/or speculated, and to indicate the severity of the interaction/effect.
  • Binary relations between NEs, to encode enriched descriptions of certain event participants.
  • Coreference relations between certain NEs, to allow the interpretation of event participants that are not fully described within the scope of the event-containing sentence.

Availability

The PHAEDRA corpus may be downloaded or visualised online. The annotation guidelines are also available for download.

If you use either the corpus or the guidelines, please observe the terms of the licence, as described below.

Licence

Creative Commons License
The annotations in the PHAEDRA corpus were created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. They are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

PLEASE ATTRIBUTE NaCTeM WHEN USING THE CORPUS, AND PLEASE CITE THE FOLLOWING ARTICLE:

Paul Thompson, Sophia Daikou, Kenju Ueno, Riza Batista-Navarro, Jun'ichi Tsujii and Sophia Ananiadou (2018). Annotation and Detection of Drug Effects in Text for Pharmacovigilance. Journal of Cheminformatics, 10:37.

Motivation

Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual data from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient location of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We have developed a semantically annotated corpus of 600 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement. To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically.