PhenoCHF Annotation Format

The downloadable corpus consists of:

  • A set of annotation files, containing the manually-added annotations associated with each document file.
  • A set of text files corresponding to the literature articles only.
    NOTE: The text files for the narrative EHR reports form part of the corpus de-identified clinical records released as part of the i2b2 2008 Obesity Challenge (NLP Dataset #2). The dataset must be obtained individually from Partners Healthcare by signing a Data Use Agreement.
The annotation files for the literature articles and associated text files are contained within the sub-directory Articles, while the annotation files corresponding to the narrative EHR reports are contained within the subdirectory NarrativeEHR.

Literature Articles

For literature articles, the text file and associated annotation files have the same base name, i.e., the PMIDs of the articles.

Narrative EHR reports

The i2b2 2008 Obesity Challenge Dataset is obtained as a single XML file, containing all clinical records. Within the XML file, each document is contained within a <doc> element, and the doc element has an id attribute, which assigns a unique id to each clinical record. Within each doc element, there is a <text> element, which contains the text of the clinical record.

  • Annotation files are provided separately for each clinical record. The basename of the annotation files corresponds to the id of the clincal record, as specified in the id attribute of the corresponding document element in the original dataset file.
  • The annotation files assume that the text for each clinical record corresponds to the text that occurs between the <text> and </text> tags for the record in the original dataset file.

Annotation file formats

Annotations are encoded in the BioNLP Shared Task 2013 format, with some custom additions to allow normalisation annotations to be encoded. Based on this format, there are two annotation files associated with each text file:

  • a1 files - encode information about entity annotations, polarity cues and normalisation annotations
  • a2 files - encode information about relation annotations.

a1 Files

In a1 files, each line corresponds to an annotation. There are two formats of lines, depending on whether they encode an entity mention annotation or a normalisation. The format of each type of line is described below:

Entity mention annotations

Entity mention annotations encode the text spans corresponding to phenotype concept mentions (or polarity cues for Negate relations, see below), and assign a semantic label, according to the type of concept being mentioned.

A sample of lines encoding entity annotations is shown below:

T1  Cause 128 151   coronary artery disease
T5	NontradRF 285 291	anemia
T6	SignOrSymptom 393 412	shortness of breath
T2	RiskFactor 211 233	deep venous thrombosis
T8	SignOrSymptom 451 469	bilateral crackles
T9	Organ 440 445	Lungs
T10	RiskFactor 6272 6281;6282 6290

Each line that encodes an entity mention consists of the following information:

  • A unique id for the entity. By convention, this starts with T, followed by a numerical value.
  • A TAB character.
  • The concept type label assigned to the annotation (or PolCue for words or phrases that denote negation, i.e., polarity cues). The labels corresponding to each concept type are shown in Table 1.
  • The character-based offsets of the entity annotation in the corresponding text file. There are two formats for the offsets, depending on whether the annotated span consists of a single, continuous span or a discontinuous span, consisting of multiple, connected spans. A discontinuous span may occur, for example, when an entity mention is broken over two lines.
    • For continuous spans (as in the first 6 lines in the sample above), there are two offsets, corresponding to the start and end offsets of the span. The first offset is separated by a space from the entity type label, and there is a space between the start and end offsets.
    • For discontinuous spans (as in the final line of the sample above), there are two or more pairs of start and end offsets, each separated by a semi-colon. Each pair of offsets corresponds to a part of the complete annotated span.
  • Another TAB character
  • The text covered by the annotated span in the corresponding text file.

Table 1 provides the labels used for each concept type.

Table 1. Labels used in annotation files for each concept type or polarity cue
Concept typeLabel used in annotation file
Risk Factor
Sign & Symptom
Non-traditional risk factor
Polarity Cue
Chief Complaint

Normalisation annotations

The normalisation annotations provide a mapping between each entity mention annotation and the identifier for a concept in the UMLS Metathesaurus (i.e., a UMLS CUI).

A sample of lines encoding normalisation annotations is shown below:

#1	UMLS_CUI T1	C1956346
#2	UMLS_CUI T5	C0002871
#3	UMLS_CUI T6	C0013404
#4	UMLS_CUI T2	C0149871
#5	UMLS_CUI T8	C2071429
The format of these lines is as follows:
  • A unique numeric identifier for the normalisation annotation. This is preceded by a hash character (#)
  • A TAB character.
  • The string "UMLS_CUI"
  • The identifier of the entity mention annotation to which the UMLS CUI has been assigned.
  • A TAB character.
  • The UMLS CUI that represents the concept described by the entity mention.

a2 Files

In a2 files, each line corresponds to a relation annotation.

Relation annotations have the following format:

R12	Causality Arg1:T18 Arg2:T17
R25	Finding Arg1:T64 Arg2:T66	
R13	Negate Arg1:T41 Arg2:T37

Each line consists of:

  • A unique id for the relation annotation. By convention, this starts with R, followed by a numerical value.
  • A TAB character.
  • The Relation type label assigned to the annotation. This is either Casuality, Finding or Negate.
  • Details of the two text spans that are linked in the relation.
    • In the case of Causality and Finding relations, both text spans correspond to entity mentions.
    • In the case of Negate relations, the first of the text spans is a polarity cue for negation, while the second is an entity mention.
  • Each text span that is linked in a relation annotation is referred to as an argument. The first argument is denoted by the label Arg1 and the second argument is denoted by the label Arg2. In each case, the argument label is followed by a colon, and then by the ID of the corresponding text span (which corresponds to one of the T annotations introduced above).