COPD Corpus Annotation Format

The downloadable corpus consists of:

  • The configuration files need to display the annotations in brat: annotation.conf, visual.conf and tools.conf (see here for more details)
  • Three directories ("train", "dev" and "test") which contain the annotated data, split into the training, development and test sets that were used in the experiments described in the associated article. Each of the directories contains the following two types of files:
    • A set of text files (.txt), each corresponding to a paragraph in a full-text article.
    • A set of annotation files (.ann), containing the manually-added annotations associated with each paragraph file

The text file and associated annotation fileshave the same base name, which denotes the article's PMCID.
The naming covention is as follows: [PMCID]_[paragraphNumber]

The paragraphs are numbered consecutively, starting from 0. So, for example, the file PMC2528206_0.txt contains the text of the first paragraph of the full text in the article with the PMCID PMC2528206, while the file PMC2740954_25.ann contains the annotations associated with the 26th paragraph of the full text in the article , etc.

Annotation file format

Annotations in the ".ann" files are encoded in the format used by the brat annotation software

Within each ".ann" file, each line corresponds to one of the following:

A sample of lines encoding entity annotations and their links to concept identifiers is shown below:

T1      AnatomicalConcept 33 42 pulmonary
N8000   Reference T1 UMLSCUI:C0024109   pulmonary
T2      Drug 191 206    corticosteroids
N11000  Reference T2 UMLSCUI:C0001617   corticosteroids
T3      MedicalCondition 92 96  COPD
N1      Reference T3 UMLSCUI:C0024117   COPD
T4      SignOrSymptom 33 55     pulmonary inflammation
N3000   Reference T4 UMLSCUI:C0032285   pulmonary inflammation
N3001   Reference T4 UMLSCUI:C3714636   pulmonary inflammation
T5      Treatment 183 206       inhaled corticosteroids
N6000   Reference T5 UMLSCUI:C0001617   inhaled corticosteroids

There are two types of lines, beginning either with "T" or with "N"

Lines beginning with "T" (NE annotations) consist of the following information:

  • A unique id for the annotation. By convention, this starts with T, followed by a numerical value.
  • A TAB character.
  • The NE type assigned to the annotation.
  • The character-based offsets of the annotated span in the corresponding text file. There are two offsets, corresponding to the start and end offsets of the span. The first offset is separated by a space from the entity type label, and there is a space between the start and end offsets.
  • Another TAB character
  • The text covered by the annotated span in the corresponding text file.

Lines beginning with "N" provide information about normalisations, i.e. links to CUIs in the UMLS Metathesurus. They consist of the following information:

  • A unique id for the annotation. By convention, this starts with N, followed by a numerical value.
  • A TAB character.
  • The word Reference
  • The id of the NE annotation to which the normalisation applies
  • Information about the concept to which the NE has been normalised. This consists of:
    • The string "UMLSCUI"
    • A colon
    • The unique concept identifier assigned to the NE within the specified resource
  • Another TAB character.
  • The text covered by the NE to which the concept ID has been assigned.

Note that, in the example above, the NE with the ID T4, i.e., pulmonary inflammation, has been normalised to two separate concepts in the UMLS Metathesaurus, as denoted by the normalisation lines with the IDs N3000 and N3001.