Named Entity Annotation

Annotation scheme

The PHAEDRA corpus is annotated with three named entity (NE) types, as a basis for linking the effects of drugs with information about the medical subject in which they occur. The scope of each NE type is outlined in Table 1.

Table 1. Types of NEs annotated
NE TypeDescriptionExamples
Pharmacological_substance Pharmacological substance that may or may not be approved for human use.
Genes/gene products used as therapeutic agents: echistatin
Generic drug names: didanosine
IUPAC and IUPAC-like chemical names of drugs: 5-hydroxy-L-tryptophan
Endogenous substances administered as exogenous drugs: insulin
Toxins: 1-methyl-4-phenyl-1,2,3,4-tetrahydropyridine
Excipients: isopropyl myristate
Generic or chemical names of metabolites: threohydrobupropion
Drug brand names: DIAMOX
Names of groups of drugs: fluoroquinolones
Expressions characterising general classes of drugs: dopamine D1 receptor antagonist
Disorder Observation about a medical subject's body or mind that is considered to be abnormal or caused by a disease, pharmacological substance or DDI.
Medical conditions: pulmonary embolism
Abnormality in physiological function: hyperlocomotion
Pathological process: fibrosis
Neoplastic process: intestinal adenocarcinomas
Damage caused by disease or drugs: cerebellar damage
Mental or behavioural issue: drug abuse
Injury or poisoning: clinical toxicities
Viruses/bacteria: Micrococcus luteus
Sign or symptom: nausea
Abnormality in clinical attributes or measurements: increased urine sodium
Subject An organism, cell line, bacterium or group thereof, whose characteristics are under discussion. The organism may be human or otherwise.
General references to groups of subjects: children
Names of specific species under discussion: mice
Names of bacteria under discussion: Klebsiella oxytoca
Expressions that specify a number of subjects: 16 patients
Descriptions of subject characteristics: 50-year old male patient

NE mention statistics

All mentions of the concepts of the types shown in Table 1 were annotated in all abstracts in the corpus. The total number of annotated instances and total number of unique annotated spans are shown in Table 2.

Table 2. Statistics of NE Mentions
NE TypeTotal number of annotated mentionsNumber of unique annotated spans
Pharmacological_substance 8099 1853
Subject 1552712


The entity annotations were undertaken by annotators with domain expertise. The quality and consistency of the annotations were verified through the calculation of inter-annotator agreement (IAA) on one quarter of the complete corpus (i.e., 150 abstracts). We calculated IAA in terms of F-Score, for both exact span matches, where the start and end of the annotated text spans chosen by both annotators must match exactly, and relaxed span matches, where it is sufficient for the annotated text spans to include some degree of overlap. The IAA statistics, in terms of F-scores, are shown in Table 3.

Table 3. Inter-annotator agreement rates (F-score)
NE TypeRelaxed MatchExact Match
Pharmacological_substance 96.0 92.8
Subject 81.181.1
TOTAL 92.686.0