NaCTeM

Manchester Molecular Pathology Innovation Centre (MMPathIC): bridging the gap between biomarker discovery and health and wealth

Background

Stratified medicine (which is allied to personalised or precision medicine) is an approach to treating patients through categorising them into groups based on their risk of developing a particular disease, or how they are likely to respond a particular drug or therapy.

It is key that the correct tests and techniques are available which can put individuals into groups (stratify patients), depending on their exact disease type and likely response to particular treatments. One way in which this might be possible is by application of molecular pathology, a specific type of pathology (the study of disease), focused on the diagnosis and repeated characterisation of disease through the examination of molecules within organs, tissues or bodily fluids, such as blood, urine or synovial fluid (the fluid found in joints). Differences in proteins in such samples between e.g., healthy people and people with a specific disease, may prove useful as biomarker tests which can be used to diagnose a disease. In addition, by examining the differences in the levels of particular marker proteins from patients who respond to a drug compared to those who doctors respond, doctors will be able to identify which drug is the best treatment for specific patients.

Aims

The aim of the Manchester molecular pathology node (Manchester Molecular Pathology Innovation Centre- MMPathIC) is to create an environment which enables new biomarker tests, based on molecular pathology techniques, to be developed. These can then be used to stratify patients, to allow more accurate diagnosis or prediction of the best treatments to use. The initial focus will be on people who suffer from inflammatory disease (psoriasis, rheumatoid arthritis and lupus), given the availability of a large number of patient samples for these diseases. It is planned to produce at least 6 new tests which are ready to be commercialised, or ready to be used in hospital pathology laboratories in the first 3 years of the grant.

MMPathIC will combine the skills of experts working in several areas. Medical expertise will be complemented by the skills of researchers working in other areas. e.g., information specialits, to allow the data procuded to be linked to genomics data, health economists, to allow informed decisions to be made by NHS officials, and text miners.

Text Mining Workstrand

Text mining (TM) will be employed to carry out automated semantic analysis of various "unstructured" textual information sources thet may contain information that is relevant to the development of biomarker tests, including biomedical literature and electronic health records. Given that each of these sources constitutes vast numbers of documents, information contained within them may be hidden and easily overlooked. TM techniques will be used in a number of ways to enhance the ease and efficiency with which unstructured textual information sources can be exploited to support the development of biomarker tests. For example:

  • To locate, structure and link together different types of information about potential biomarkers which may be dispersed across documents of different types, e.g., genetic variations that can be indicative of a particular disease, in which types of patients such variations occur, what type of changes occur in response to drugs, etc.
  • To allow the discovery of potentally unknown associations (e.g. between proteins and diseases), which could act as a stimulus of invstigating novel biomarkers.

Text Mining Outcomes

Among the TM-related outcomes are the following:

Novel Methods

A number of methods have been developed to aid in automatically finding and interpreting important information that is specified in text. Various methods have been developed that can be applied to both formal academic articles and clinical records:

  • Detecting mentions of concepts in text that are relevant to biomarkers, including medical problems, symptoms, drugs, genes, proteins, treatments, tests, phenotypic information and medical subjects.
  • Mapping (or normalising) these mentions to unique concepts in domain specific databases, given that there are often many ways in which a single concept can be mentioned in text.
  • Determining the level of risk associated with different types of concepts
  • Detecting relations between concepts mentioned in text, such as associations between genes and diseases, or between drugs and medical problems. Novel approaches have been developed to find relations both within and across sentences, as well as across different documents
  • Determining how extracted relations should be interpreted, e.g., whether they constitute hypotheses, well-known knowledge or new knowledge that results from research carried out in a paper
  • Methods to aid in systematic reviews of the literature

Annotated corpora

We have developed and released two annotated corpora as part of the project. Annotated corpora are collections of documents in which domain experts have manually marked-up various types of pertinent information. Annotated corpora are vital to support the development of text mining tools, such as those for extracting concept mentions and relationships between them. By making them available to the text mining community, we can stimulate the development of increasingly sophisticated text mining methods. The corpora that we have developed are the following:

  • COPD corpus - a semantically annotated corpus, focussed on phenotypic information, consisting of 30 full-text articles. The corpus has been manually annotated with concept mentions, using a fine-grained annotation scheme, which aims to capture detailed information about COPD phenotypes. In particular, the annotations may be "nested" within each other. This is to take into account the potentially complex and nested nature of phenotype descriptions. The types of information annotated include problems, treatments, tests, drugs and genes/proteins.
  • PHAEDRA corpus - a semantically annotated corpus for pharmacovigilance (PV), consisting of 597 MEDLINE abstracts. Its fine-grained, multiple levels of annotation, added by domain-experts, make it a unique resource within the field, and aim to encourage the development/adaption of novel machine learning tools for extracting PV-related information from text. Levels of annotation include concept mentions, relations between them and interpretative information.

User interfaces

Several of the methods above have combined into user interfaces that make it easier to make use of information contained in large volumes of text for semantic search, browsing annd discovery of hidden knowledge. A number of interfaces have been developed and/or updated within the context of the MMPathIC project:

  • Thalia - (Text mining for Highlighting, Aggregating and Linking Information in Articles) - a semantic search engine that can recognise concepts occurring in biomedical abstracts indexed on Pubmed. It currently recognises eight types of concepts, namely: chemicals, diseases, drugs, genes, metabolites, proteins, species and anatomical entities.
  • FACTA+ - a MEDLINE search engine for finding associations between biomedical concepts.
  • RobotAnalyst - A tool to minimise the human workload involved in the study identification phase of systematic reviews.

Publications

Christopoulou, F., Tran, T. T., Sahu, S. K., Miwa, M. and Ananiadou, S. (2019). Adverse Drug Events and Medication Relation Extraction in EHRs with Ensemble Deep Learning Methods. Journal of the American Medical Informatics Association, ocz101

Ju., M., Nguyen, N. T. H., Miwa, M. and Ananiadou, S. (2019). An Ensemble of Neural Models for Nested Adverse Drug Events and Medication Extraction with Subwords. Journal of the American Medical Informatics Association, ocz075

Wang, Y., Fan, X., Chen, L., Chang, E. I-C., Ananiadou, S., Tsujii, J. and Xu, Y. (2019). Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinformatics, 20:430.

Ju., M., Short, A.D., Thompson, P., Bakerly, N. D., Gkoutos, G., Tsaprouni, L. and Ananiadou, S. (2019). Annotating and Detecting Phenotypic Information for Chronic Obstructive Pulmonary Disease. JAMIA Open, 2(2), 261-271

Bannach-Brown, A., PrzybyƂa, P., Thomas, J., Rice, A., Ananiadou, S., Liao, J. and Macleod, M. R. (2019). Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Systematic Reviews, 8:23

Przybyła, P., Brockmeier, A. J. and Ananiadou, S. (2019). Quantifying Risk Factors in Medical Reports with a Context-Aware Linear Model. Journal of the American Medical Informatics Association, 26(6), 537-546.

Soto, A., Przybyła, P. and Ananiadou, S. (2018). Thalia: Semantic search engine for biomedical abstracts. Bioinformatics

Thompson, P., Daikou, S., Ueno, K., Batista-Navarro, R., Tsujii, J. and Ananiadou, S. (2018). Annotation and Detection of Drug Effects in Text for Pharmacovigilance. Journal of Cheminformatics, 10:37.

Thompson, P. and Ananiadou, S. (2018). HYPHEN: A flexible, hybrid method to map phenotype concept mentions to terminological resources. Terminology, 24(1), 91-121.

Shardlow, M., Batista-Navarro, R., Thompson, P., Nawaz, R., McNaught, J. and Ananiadou, S. (2018). Identification of Research Hypotheses and New Knowledge from Scientific Literature. BMC Medical Informatics and Decision Making, 18:46.

Thompson, P. and Ananiadou, S. (2017). Extracting Gene-Disease Relations from Text to Support Biomarker Discovery. In Proceedings of the 7th International conference on Digital Health, pp. 180-189

Thompson, P., Boylan, K., Freemont, A. and Ananiadou, S. (2017). Supporting biomarker discovery using text mining. In Proceedings of Informatics for Health 2017.

Project Team

Principal Investigator: Professor Anthony Freemont (Institute of Inflammation and Repair, Faculty of Medicine and Human Sciences)

Co-Investigators:
Sophia Ananiadou (School of Computer Science, The University of Manchester)

Professor Anne Barton (Centre for Musculoskeletal Research Arthritis Research UK Epidemiology Unit, The University of Manchester and Central Manchester Foundation Trust)

Professor Graeme Black (Manchester Centre for Genomic Medicine, Central Manchester University Hospitals NHS Foundation Trust

Professor Ian Bruce (Institute of Inflammation and Repair, The University of Manchester; The Kellgren Centre for Rheumatology, Central Manchester and Manchester Children's University Hospitals Trust)

Professor Iain Buchan (MRC Health eResearch Centre/ Farr Institute for Health Informatics Research, The University of Manchester; Northwest e-Health, Salford Royal Foundation NHS Trust)

Dr Richard Byers (Institute of Cancer Sciences, The University of Manchester; Manchester Royal Infirmary, Central Manchester University Hospitals NHS Foundation Trust)

Professor Caroline Dive (Cancer Research UK Manchester Institute, The University of Manchester)

Professor Royston Goodacre (School of Chemistry, The University of Manchester)

Professor Katherine Payne (Manchester Centre for Health Economics, The University of Manchester)

Professor John Radford (Institute of Cancer Sciences, The University of Manchester; Christie NHS Foundation Trust)

Professor Anthony Whetton (Institute of Cancer Sciences, The University of Manchester; MRC Clinical Proteomics Centre)

Research Fellows:
Mr. Paul Thompson (NaCTeM)
Dr. Alexander Thompson (Health Economics)
Dr. Nophar Geifman (Health and Biomedical Informatics)

Funding

This project, which runs from October 2015 until September 2019, is being funded by the MRC and EPSRC