Time-sensitive medical inventory format

The inventory is made available as a set of seven files, each containing heading terms belonging to one of the seven categories shown in Table 1.

The files are named in the following way:

all_nes_bmj_moh_normalised.thesaurus_[CAT].obo
where [CAT] corresponds to one of the semantic categories.

It should be noted that a particular heading term and its related terms may occur in more than one file, since a heading term may correspond to multiple entities belonging to more different semantic classes (e.g. if the heading term can have more than one sense).

The files are made available in the format used by the Open Biological Biomedical Ontologies (OBO).

This is a widely used format used to encode various different ontologies that cover medical and biomedical subdomains, including diseases, anatomical entities and environmental entities. Releasing the inventory in this format has several advantages. Firstly, using a standardised format opens up possibilities for future integration of the inventory with other ontologies covering relevant areas. Secondly, the files can be processed/visualised with existing tools, such as Protégé and OBO-Edit.

NOTE: Due to the large size of some of the terminology files, the default memory allocation for these tools must be changed. We found that Protégé should be used preferentially for visualising the files, since it seems better able to handle the large files, especially when the memory allocation flags set in the run.command file are set to -Xmx6G -Xms2G. Whilst OBO-Edit also seems able to load the larger files (especially when the OBOEdit.vmoptions file is edited to allocate more memory, we tried -Xmx6000M), it seems less responsive than Protégé when a large file has been loaded.

It should however be noted that the main purpose of the files is for computational processing, rather than for human viewing.

An example entry in the file (for the term pulmonary tuberculosis) is shown below.

[Term]
id: HOM:1159
name: pulmonary tuberculosis
synonym: "non pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.9285235631916715]
synonym: "miliary pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8890370477480192]
synonym: "tuberculosis pulmonary" RELATED DS_RELATED [DS_SCORE:0.8694139206266057]
synonym: "chronic pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8654668011030042]
synonym: "pulmonary y tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8619417498619306]
synonym: "pulmonary phthisis" RELATED DS_RELATED [DS_SCORE:0.8421124650540113]
synonym: "inactive pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.832065670929346]
synonym: "old pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8307341025365899]
synonym: "primary pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8298542706890558]
synonym: "bilateral pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8191471054604339]
synonym: "tuberculosis miliary" RELATED DS_RELATED [DS_SCORE:0.8123758775790052]
synonym: "incipient pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8069857568206482]
synonym: "pulmonary mycobacterium tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8068339428878909]
synonym: "pulmonary tuberculosis active" RELATED DS_RELATED [DS_SCORE:0.8064070676210551]
synonym: "infectious pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.80465669670636]
synonym: "extrapulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.8028301888191761]
synonym: "acute miliary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.7976048542827227]
synonym: "symptomless pulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.7963059836059512]
synonym: "chronic miliary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.795678270460759]
synonym: "nonpulmonary tuberculosis" RELATED DS_RELATED [DS_SCORE:0.7922752141018254]
subset: condition
xref: UMLS:C0041327

The general format of OBO files is fully documented elsewhere, but here we explain some of the features of the OBO file encoding the time-sensitive medical terminological inventory.

Each entry includes the following parts:

  • [Term] - The first line of an entry.
  • id - provides a unique id for the term
  • name - the "heading term" for this entry
  • synonym - denotes one of the 20 more similar semantically related terms for each heading term, according to distributional semantics. Note that the semantically related term is not necessarily a synonym, but the word is required to comply with the OBO format;a more exact characterisation of the nature of the relationship is provided in a further part of the line, as detailed below. Following the hyphen, there are four further parts to each line, separated by spaces
    1. The related term, enclosed in quotes
    2. RELATED - One of a fixed set of synonym scope values possible in the OBO format, denoting that there is a semantic relation of a general nature between the heading term and the semantically-related term. This is used in all cases, since the current process for extracting terms does not attempt to distinguish between different types of semantic relations.
    3. DS_RELATED - denotes that the related term has been extracted using distributional semantics techniques.
    4. DS_SCORE - A score for the related term. This score (the cosine similarity) represents the level of similarity between the textual contexts of the head term and the textual contexts of the related term.
  • subset - the semantic category assigned to the term by the NE recogniser. Possible values are as follows:
    • anatomical_entity
    • biological_entity
    • condition
    • enviromental_entity
    • subject
    • sign_or_symptom
    • therapeutic_or_investigation_entity
  • xref - zero or more lines with an identifier (beginning with a "C"), corresponding to a concept identifier in the UMLS Metathesaurus.