Named Entity Annotation

Annotation scheme

The annotation scheme aims to capture fine details about phenotypes. Using a detailed, hierarchically structured set of semantic labels, and allowing entity spans to be nested within each other, potential relationships between entities are captured, e.g., if a treatment is mentioned within a phenotype (e.g. Steroid-induced skeletal muscle atrophy), then it is likely that the phenotype is caused or affected by the nested treatment.

The hierarchy of NE labels in our scheme is shown below, color-colded according to the level in the hierarchy. For each NE annotated, annotators were instructed to assign the most specific label possible. Table 1 below provides definitions and examples of each of these categories, together with annotation counts in the final corpus.

  • Problem
    • MedicalCondition
    • RiskFactor
      • SignOrSymptom
      • IndividualBehaviour
      • TestResult
  • Treatment
  • Test
    • RadiologicalTest
    • MicrobiologicalTest
    • PhysiologicalTest
  • ConstituentConcept
    • AnatomicalConcept
    • Drug
    • Protein
    • Quality
Table 1. Types of NEs annotated
NE TypeDescriptionExamplesNumber of annotations
ProblemAn overall category for any COPD indicates of concernfrequent exacerbator2556
MedicalConditionAny disease or medical condition; includes COPD comorbiditiesemphysema, pulmonary vascular disease, asthma5119
RiskFactorA phenotype signifying a patient's increased chances of having COPDincreased levels of the C-reactive protein, alpha1 antitrypsin deficiency1211
SignOrSymptomAn observable irregularity manifested by a COPD patientchronic cough, shortness of breath2065
IndividualBehaviourA patient's habits leading to susceptibility of having COPDsmoking for 25 years194
TestResultFindings based on COPD-relevant examinationsdecrease in rate of lung function, FEV1 45% predicted685
TreatmentAny medication, therapy or program for treating COPDoxygen therapy, pulmonary rehabilitation4337
TestAn overall category for any COPD-relevant examinations or measures/parametersincreased compliance of the lung, FEV1, FEV1/FVC ratio3576
RadiologicalTestAny of the radiological tests for detecting COPDcomputed tomography scanning, high resolution computed tomography29
MicrobiologicalTestAn examination of a COPD-relevant specimencomplete blood count11
PhysiologicalTestA measurement of a COPD patient's capacity to exercise6-min walking distance17
ConstituentConceptan umbrella type for elementary concepts that may form part of a phenotype description; should only be chosen if none of the subtypes below applybreath, wheezes, air5
AnatomicalConcepta mention pertaining to anatomical entitieslung, heart, pulmonary, hepatic, respiratory airway2616
Drugany drug name; will mostly overlap with Treatmentcorticosteroids2593
Proteinany protein namealpha1 antitrypsin820
Qualityexpressions which describe any of the concepts abovechronic, obstructed, damaged, decreased rate, enhanced, decreased amount1153


The entity annotations were undertaken by annotators with domain expertise. The quality and consistency of the annotations were verified through the calculation of inter-annotator agreement (IAA) on six full papers. We calculated IAA in terms of F-Score, for using strict conditions (i.e., requiring both annotators' annotations to match exactly in terms of the text span chosen and the semantic category assigned). The F-Score was 80.49