COPD corpus


The COPD corpus is a semantically annotated corpus, focussed on phenotypic information, consisting of 30 full-text articles. The corpus has been manually annotated with named entities, using a fine-grained annotation scheme, which aims to capture detailed information about COPD phenotypes. In particular, the annotations may be "nested" within each other. This is to take into account the potentially complex and nested nature of phenotype descriptions, which may include mentions of various other types of concepts within them, such as the example shown in Figure 1.

Fig. 1 - Example of a complex phenotype description that includes other concepts nested within it.

In Figure 1, the phenotype elevation of pulmonary arterial pressures is assigned the category TestOrMeasureResult, since it describes the results of carrying out a test or measurement. An analysis of the internal structure of this phenotype reveals the specific test/measure undertaken (pulmonary arterial pressures) and the anatomical entity whose pressure is being measured pulmonary arterial).

The annotations in the COPD corpus correspond to both:

  • complete phrases that constitute COPD phenotypes
  • other types of concepts frequently mentioned within COPD phenotype phrases, and/or which are mentioned within the context of these phenotypes.

The scheme used to annotate the COPD corpus is aimed at supporting:

  • automated location and categorisation of COPD phenotypes, e.g., those identified through tests, or those constituting risk-raising individual behaviours (such as smoking)
  • detailed investigations about the nature of these phenotypes, such as finding those affecting specific anatomical locations, or those concerning different results of specific tests, etc.

The COPD corpus annotations consist of the following: