PhenoCHF Download

The annotations may be downloaded for research purposes (please observe the terms of the licence below).



  • Information about annotations is provided in separate files from the text that has been annotated. The format of these annotation files is described in detail on the annotation format page.
  • The associated text files for each part of the two documents types in the corpus are obtained in different ways, as detailed below.
    1. Full text literature articles - These are open acess papers, and we provide the plain text files that were used as a basis for the annotation as part of the corpus download. The basename of each file name is the PMID of the associated article.
    2. Narrative EHR reports - These form part of the dataset of de-identified clinical records released as part of the i2b2 2008 Obesity Challenge (NLP Dataset #2). The dataset must be obtained individually from Partners Healthcare by signing a Data Use Agreement.
      • IMPORTANT NOTE: The i2b2 2008 Obesity Challenge Dataset is obtained as a single XML file, containing all clinical records. Within the XML file, each document is contained within a <doc> element, and the doc element has an id attribute, which assigns a unique id to each clinical record. Within each <doc> element, there is a <text> element, which contains the text of the clinical record.
        • Annotation files are provided separately for each clinical record, in the format described on the annotation format page. The basename of the annotation files corresponds to the id of the clincal record, as specified in the id attribute of the corresponding document element in the original dataset file.
        • The annotation files assume that the text for each clinical record corresponds to the text that occurs betwen the <text> and </text> tags for the record in the original dataset file.

PhenoCHF corpus licence

1. Copyright of Literature Articles

The full text literature articles in the PhenoCHF corpus are drawn from the PMC Open Access Subset. These articles are protected by copyright, but are made available under a Creative Commons or similar licence that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific licence terms.

2. Copyright of PhenoCHF annotations

Creative Commons License
The entity mention, relation and normalisation annotations in the PhenoCHF corpus were created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. They are licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus and cite one or more of the following papers, depending on which annotations are used:

Entity Annotations

Alnazzawi, N., Thompson, P., Batista-Navarro, R. and Ananiadou, S. (2015). Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. BMC Medical Informatics and Decision Making, 15(Suppl. 2): S3

Normalisation Annotations

Alnazzawi, N., Thompson, P. and Ananiadou, S. (2016). Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLOS ONE, 11(9): e0162287

Relation Annotations

Alnazzawi, N., Thompson, P. and Ananiadou, S. (2014). Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi), pp. 69-74.