NaCTeM

Time-sensitive inventory of medical terminology

Description

This inventory contains a set of terms that are relevant to the study of medical history.

The inventory is organised as a set of "heading terms" (around 175,000). Each heading term, which belongs to one or more of seven different semantic categories (shown in Table 1), is accompanied a set of semantically-related terms. These related terms have been automatically extracted using text mining methods from large collections of published medical text, dating from 1840 to the present day.

The nature of the semantic relationship holding between the heading term and each related term varies. Some examples of possible relationships holding between pairs of terms include the following:

  • Related terms may be synonyms of each other. For example, pulmonary tuberculosis, pulmonary phthisis and tuberculous consumption are identified as terms related to pulmonary consumption.
  • One term may be more or less specific than the other. For example, the terms flat, shelter, chapel and hut are identified as terms related to building
  • One term may correspond to a part of the other. For example, ankle, shin and thigh are identified as terms related to leg.
  • One term may occur (spatially) in the proximity of the other. For example, larynx, pharynx and bronchiole are identified as terms related to trachea.
  • One term may be used in the treatment of the other. For example the drugs glibenclamide, metformin and tolbutamide are identified as terms related to diabetes.

The unique feature of our terminological inventory is that the semantically-related terms may correspond to terms used within different periods of time, and which may not be in common usage today.

Availability

The inventory is available to download. The format of the inventory is described on a separate page. Please observe the terms and conditions of the licence (see below).

Motivation

Over time, shifts/evolutions in terminology, advances in medical knowledge, living conditions, etc., mean that closely related terms are likely to be subject to change over time. Accordingly, in studying historical change, it can often be difficult to know which query terms to use in a search system, to ensure that relevant documents are retrieved from different periods of time.

It can be extremely useful if the system is able suggest terms that are related to query terms, and which help to widen the scope of the search. This may help either in retrieving documents from a wider time range and/or to help them to explore areas related to their original concepts within a particular time period. The inventory has been used to provide such functionality in the History of Medicine search system.

As an example, within our inventory, the heading term rubella (a viral infection formerly common in children) includes amongst its semantically related terms a historically-relevant synonym (Rotheln), as well other viruses and viral infections (e.g., smallpox, chickenpox, rotavirus, poliovirus), some of which are also particularly common in childhood.

As a further example, the environmental factor overcrowding has related terms corresponding to other poor living conditions that may have contributed to certain diseases in the past (dilapidation, ventilation, cleanliness, etc) together with other environmental entities representing the structures in which such conditions occur (e.g., house and its more temporally-sensitive synonym dwelling, workroom etc.)

Inventory creation

The inventory was created automatically by applying a text mining technique called distributional semantics to two large collections of published medical documents, each spanning a long period of time. These collections are as follows:

  • The archive of the British Medical Journal (BMJ). This journal is aimed at medical professionals, and includes various types of articles, including research, analysis, practice, case reports, letters, and obituaries. We worked with a collection of approximately 380,000 articles, spanning from 1840 to 2013.
  • The London Medical Officer of Health reports (MOH) reports are concerned with examining public health issues in different London boroughs. The archive consists of around 5,000 reports produced between 1848 and 1972, whose lengths range from a few pages to several hundred pages.

Distributional semantic models (DSMs) exploit the observation that words that appear in similar textual contexts (e.g., which are preceded/followed by similar sets/sequences of words) often exhibit similar meanings.

As our starting point, we used as heading terms a set of terms that had been automatically recognised in the above collections through the application of specialised named entity (NE) recognition tool. NE recognition tools automatically locate terms corresponding to entities of interest in texts, and assign appropriate semantic categories to them. We applied an NE tool that was developed in the context of the same project to recognise entities of historical and/or medical significance in medical documents from a range of historical periods. The semantic categories of the terms recognised by our NE recognition tool are detailed in Table 1.

Table 1. NE types included in the terminological inventory
Entity TypeDescriptionExamples
Condition Medical condition/ailment phthisis, bronchitis, typhus
Sign_or_SymptomAltered physical appearance/behaviour as probable result of injury/conditioncough, pain, rise in temperature, swollen
Anatomical Entity forming part of human body, including substances and abnormal alterations to bodily structureslung, lobe, sputum, fibroid
Subject Individual or group under discussion children, asthma patients, those with negative reactions to tuberculin
Therapeutic_or_
Investigational
Treatment/intervention administered to combat condition (including diet/foodstuffs), or substance, medium or procedure used in investigational medical or public health context atrophine sulphate, generous diet, change of air, lobectomy
Biological Entity Living entity not part of human body, including microorganisms, animals and insectstubercle bacilli, mould, guinea-pig, flea
Environmental Environmental factor relevant to incidence/prevention/control/treatment of condition. Includes climatic conditions, foodstuffs, infrastructure, household items or occupations whose environmental factors are mentioned humidity, high mountain climates, infected milk, linen, drains, sewers, dusty occupations

The complete document collections were then further processed to determine the various textual contexts in which each heading term can appear. Other terms with similar contexts were then found and listed in the inventory as related terms of the heading term. By collecting contextual information from the complete, long-spanning text collections, we try to ensure that related terms that may be relevant at different periods of time are included in the inventory.

In the final inventory, each heading term is accompanied by the following information:

  • The set of the 20 terms that are considered to be most closely related to the heading term, according to contextual similarity.
  • Each of the 20 most related terms is accompanied by a numerical score that represents the degree of similarity between the contexts of the related term and the contexts of the heading term. This score is called the cosine similarity.
  • The semantic category (amongst those listed in Table 1) which was assigned to the heading term by the NE recognition tool
  • Links to one or more concepts in the UMLS Metathesaurus. This is another large-scale resource that includes biomedical and health related concepts, which has been largely created manually. In the Metathesaurus, concepts are represented by a set of synonymous terms, and different types of relationships are identified between concepts. In our terminological inventory, we include the identifiers of all Metathesaurus concepts in which our heading term figures within their list of synonyms. Whilst there is a large degree of overlap between the scopes of the UMLS Metathesaurus and our terminological inventory, it is expected that the information present within each resource can complement the information present within the other. Specifically, the UMLS Metathesaurus does not aim to include comprehensive coverage of historical term variants, whereas the identification of historical related terms has been a major aim of our work. Thus, the inclusion of potentially related UMLS concept identifiers within our inventory provides future scope for linking together the two resources.

Inventory usage

The time-sensitive inventory of medical terminology is primarily intended to be used in search systems over historical medical archives, as means to help users to widen the scope of their search, both in terms of the range/depth of topics explored, and in order to allow the retrieval of relevant documents over a wide time period.

When a term is searched for by a user, it can be looked up as a heading term in the terminological inventory, and the 20 most related terms listed can be suggested as possible ways to expand the query, as is done in the History of Medicine semantic search system. The inclusion of the contextual similarity scores for each related term also allows the possibility of filtering the complete list of related terms according to some threshold, such that less related terms are excluded. Furthermore, linking with the UMLS Metathesaurus could help to identify further potentially related terms.

The screenshots below show how the terminological inventory is used in the History of Medicine semantic search system.

When a medically-relevant search term is entered by the user, a set of related terms is displayed, by accessing the time-sensitive terminological inventory. This is illustrated in Figure 1. The differing sizes of the related terms provide an indication of their level of contextual similarity to the query term entered, found by accessing the scores associated with the related terms. In the interface, clicking on a term causes it to be added to the query.

Related Terms
Fig 1. - Display of related terms in the HOM interface.

Figure 2 shows how the frequencies of occurrence of the terms in the document archives over time are used as a guide to indicate the time-senstive nature of the terms. In this example, the user has widened their search by choosing the term pulmonary phthisis as a related term of pulmonary tuberculosis. The graph shows that pulmonary phthisis, whilst initially more common than pulmonary tuberculosis, largely fell out of use after the 1930s.

Time Graph Fig 2. - Term usage over time in the HOM interface.

Evaluation

An evaluation of the performance of our method in terms of its ability to recognise related terms of disease entities showed that it was able recognise synonyms that are not listed in the UMLS Metathesaurus and that it was able to recognise terms semantically related to the heading term in a variety of ways. Evaluation by a medical history expert revealed that the majority (62%) of automatically identified terms was deemed to be semantically related to the relevant heading term.

Acknowledgements

The terminological inventory was created as part of the AHRC-funded "Mining the History of Medicine" project (Grant No: AH/L00982X/1). We would like to thank the BMJ for granting access to their archive of articles, and the Wellcome Trust, for consenting to the use of the MOH reports.

Time-sensitive medical terminology licence

Creative Commons License
The time sensitive inventory was created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus, and please cite the following article:

Paul Thompson, Riza Theresa Batista-Navarro, Georgios Kontonatsios, Jacob Carter, Elizabeth Toon, John McNaught, Carsten Timmermann, Michael Worboys and Sophia Ananiadou (2016). Text Mining the History of Medicine. PLOS ONE, 11(1): e0144717.