Entity normalisation

HYPHEN Method

We have applied an automatic normalisation method (HYPHEN) to link our annotated named entities to UMLS concept identifiers (CUIs), wherever possible.

HYPHEN is a hybrid method that it employs a pipeline of different techniques to generate variations of the original NE mention (mostly based on systematic syntactic and semantic variations of the original mention) and tries to match generated variants against existing variants listed in the target terminological resources. HYPHEN employs the following six individual techniques:

  1. Acronym and abbreviation expansion and context-sensitive disambiguation (e.g. Type 2 DM -> Type 2 diabetes mellitus)
  2. Conversion of plural forms to singular (e.g., alveolar septa -> alveolar septum)
  3. Generation of English equivalents of Neoclassical compounds (e.g. elevated blood leukocyte counts -> elevated white blood cell counts)
  4. Generation of Neoclassical equivalents of English terms (e.g., tpleural inflammation -> pleuritis)
  5. Generation of syntactic variants (e.g., supplemental oxygen -> oxygen supplementation)
  6. Generation of synonyms (e.g., worsening pulmonary function -> deterioration of lung function)

The processes are applied in the order shown above, based on the results of experiments to determine the optimal ordering. The output of each process is passed as input to the next technique in the sequence. This can increase the accuracy of normalisation, since multiple transformations are sometimes necessary to allow mapping to the terminological resource, e.g. ,hypertensive eyes -> (singular) -> hypertensive eye -> (syntactic) -> eye hypertension -> (Neoclassical) -> ocular hypertension [UMLS: C0028840]. The pipeline is terminated as soon as one of the techniques generates a variant that can be matched in the terminological resource.

Thompson, P. and Ananiadou, S. (2018). HYPHEN: A flexible, hybrid method to map phenotype concept mentions to terminological resources. Terminology, 24(1), 91-121.

Statistics of applying HYPHEN to the annotations of the COPD Corpus

In Table 1, we report on the number of NE annotations in the COPD corpus to which HYPHEN is able to assign UMLS CUIs. As illustrated in the table, HYPHEN is able to normalise a large percentage of the entities belonging to the majority of categories in our corpus to UMLS concept identifiers.

Table 1. Number of NEs in the COPD corpus normalised to UMLS concepts by HYPHEN
CategoryTotal NEs# NE normalised% NEs normalised
Problem2556215183.15
MedicalCondition5119496997.07
RiskFactor121194277.79
SignOrSymptom2065114055.21
IndividualBehaviour19412463.92
TestOrMeasureResult68525937.81
Treatment4337377587.04
TestOrMeasure3576260972.96
AnatomicalConcept2616237290.67
Drug2593236891.32
Protein82072787.66
Quality1153101588.03

Some examples from the corpus of successful normalisations by HYPHEN are shown in Table 2.

Table 2. Sample normalisation results
Entity annotationSemantic CategoryMapped UMLS concept
increased PVRProblemincreased pulmonary vascular resistance (C1867423)
lung failureMedicalConditionpulmonary failure (C0948755)
left atrialAnatomicalConceptleft atrium (C0225860)/
arm trainingTreatmentupper limb training (C0556501)
spirometric testTestOrMeasurespirometry test (C0037981)
genetic predispositionRiskFactorgenetic susceptibility to disease (C1455997)

Visualisation of concept IDs in brat

When visualising the annotated corpus, information about NEs that have been automatically linked to concept IDs by HYPHEN can be viewed by hovering the mouse over the relevant entity. Should normalisation have been carried out for the term, then the following information will be displayed:

  • CUI assigned to the entity. This appears right aligned underneath the horizontal line. This is preceded by UMLSCUI
  • A preferred/heading term for the concept. This is preceded by UMLS in bold.

Note that some entitites are normalised to more than one UMLS concept. In this case, information for each of the CUIs to which the entity is mapped are shown.

For example, Figure 1 shows that the IndividualBehaviour annotation never smokers has been linked to the UMLS concept Never smoked tobacco with CUI C0425293, while in Figure 2, the Problem annotation AE-COPD has been linked to the UMLS concept Acute exacerbation of chronic obstructive airways disease with CUI C0340044.


Figure 1: Display of concept ID information an IndividualBehaviour annotation


Figure 2: Display of concept ID information for a Problem annotation