NaCTeM

Text Mining Research Projects

In addition to the development of text mining services and software tools, members of the National Centre for Text Mining are involved in a variety of other projects which influence and contribute to the work of the centre:

Current Projects

Argo

The objective of the Argo project is to develop a workbench for analysing (primarily annotating) textual data. The workbench, which is accessed as a web application, supports the combination of elementary text-processing components to form comprehensive processing workflows. It provides functionality to manually intervene in the otherwise automatic process of annotation by correcting or creating new annotations, and facilitates user collaboration by providing sharing capabilities for user-owned resources. Argo benefits users such as text-analysis designers by providing an integrated environment for the development of processing workflows; annotators/curators by providing manual annotation functionalities supported by automatic pre-processing and post-processing; and developers by providing a workbench for testing and evaluating text analytics.

Automated screening for systematic reviews

This project aims to develop new text mining methods to assisy with screening in systematic reviews and health technology assesments. The new methods developed will include automated screening prioritisation, such that studies at the top of the list are those that are most likely to be relevant for manual screening, and automatic classification of documents, according to whether they should be included or excluded from the screening process. The new methods aim to reduce the burden of screening in reviews, to allow reviews to be completed more quickly, as well as minimising the impact of publication bias and reduce the chances that relevant research will be missed.

Big Mechanism

Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. Whilst the collection of big data is increasingly automated, the creation of big mechanisms remains a largely human effort, which is becoming made increasingly challenging, according to the fragmentation and distribution of knowledge. The ability to automate the construction of big mechanisms could have a major impact on scietific research. As one of a number of different projects that make up the big mechanism programme, our aim is to assemble an overarching big mechanism from the literature and prior experiments and to utilise this for the probabilistic interpretation of new patient panomics data. We will integrate machine reading of the cancer literature with probabilistic reasoning across cancer claims using specially-designed ontologies, computational modeling of cancer mechanisms (pathways), automated hypothesis generation to extend knowledge of the mechanisms and a 'Robot Scientist' that performs experiments to test the hypotheses. A repetitive cycle of text mining, modelling, experimental testing, and worldview updating is intended to lead to increased knowledge about cancer mechanisms.

COPIOUS

This project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.

EMPATHY

The EMPATHY project aims to support metobolic pathway model curation through the integration of text mining methodologies into a pathway reconstruction platform. Specifically, we set out to accomplish the following: creation of a web-based platform that will allow users to develop their reconstructions using a graphical, user-interactive interface; development of advanced text mining (TM) methods for extracting information on metabolic reactions from literature; integration of TM methods into the reconstruction platform to facilitate the automatic provision of literature-based evidence and revision suggestions to the user; development of an active learning-like mechanism that iteratively captures a user's feedback on text-mined evidence/suggestions and recalibrates the underlying tools in order to produce improved results.

Europe PMC

This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and MIMAS, forming a work package in the Europe PMC project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PuBMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies from the biomedical research funders. Our contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.

Mining Biodiversity

The Mining Biodiversity project aims to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project integrates novel text mining methods, visualisation, crowdsourcing and social media into the BHL. The resulting digital resource will provide fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.

The project will apply text mining methods to add semantic metadata to two digitised medical textual resources with archives dating back to the 1840s, i.e. the British Medical Journal (BMJ) and London-area Medical Officer of Health (MOH) reports. Major outcomes of the project will be a novel temporal terminological resource, which will identify and record terminological variation and semantic shift over time, and a new semantic search system over the enriched archives, which will help historians in broadening and deepening their work to ask 'big' questions that cover long periods, without losing sensitivity to changes in terminology and meaning.

Mining for Public Health

This project aims to conduct novel research in text mining and machine learning to transform the way in which evidence-based public health (EBPH) reviews are conducted. The aims of the project are to develop new text mining unsupervised methods for deriving term similarities, to support screening while searching in EBPH reviews and to develop new algorithms for ranking and visualising meaningful associations of multiple types in a dynamic and iterative manner. These newly developed methods will be evaluated in EBPH reviews, based on implementation of a pilot, to ascertain the level of transformation in EBPH reviewing.

MMPathIC

The aim of this project is to create an environment which enables new biomarker tests, based on molecular pathology techniques, to be developed. These can then be used to stratify patients, to allow more accurate diagnosis or prediction of the best treatments to use. The initial focus will be on people who suffer from inflammatory disease (psoriasis, rheumatoid arthritis and lupus), given the availability of a large number of patient samples for these diseases. Text mining will be employed to carry out automated semantic analysis of various "unstructured" textual information sources thet may contain information that is relevant to the development of biomarker tests, including biomedical literature and electronic health records. Given that each of these sources constitutes vast numbers of documents, information contained within them may be hidden and easily overlooked. TM techniques will be used in a number of ways to enhance the ease and efficiency with which unstructured textual information sources can be exploited to support the development of biomarker tests.

OpenMinTeD

The Open Mining Infrastructure for Text and Data (OpenMinTeD) project seeks to develop an interoperable text mining infrastructure that will unite the efforts of several key players in the text mining world. Crucially, this project involves the communities at the heart of using text mining with partners in the life sciences, the social sciences and scholarly communication. The project will develop an infrastructure which combines the power of several established text mining systems (including our platform, Argo). We will publish interoperability guidelines that will allow other systems to integrate with the OpenMinted platform. The broad aim of this project is to unite the efforts of text miners across Europe and the world, simultaneously promoting reusability and community uptake.

UIMA

One of the core challenges facing text mining and natural language processing (NLP) researchers and tool developers is the general lack of interoperability between different tools and resources. At NaCTeM we have looked to solve this by adapting our tools to function within the UIMA framework enabling direct interaction with those tools provided by other groups around the world.

Our UIMA work is widely recognised, and Prof. Sophia Ananiadou, the director of NaCTeM, has received IBM UIMA Innovation Awards successively for the years of 2006, 2007 and 2008

Past Projects

ADVISES

The ADVISES project will create a new way of communicating with computers for scientists. At present they have to use difficult tools which require them to speak the computers' language rather than express what they want in English. Worse still, the computer tools don't talk to each other so they have to use separate tools for statistics, then visually display results on a map, etc. We will deal with these problems by analysing the way scientists express their requests in English to create a 'sub-language' - that is, a restricted set of English for asking scientific questions and saying how results should be displayed. A video explaining the implemented demo system is available.

ASSERT

The JISC-funded ASSERT (Automatic Summarisation for Systematic Reviews using Text Mining) project is a continuation of the National Centre for Text Mining into the area of social sciences. The overall aim of ASSERT is to encourage greater participation by the social sciences community in e-Research by developing a summarisation service to facilitate the production of systematic reviews and to support a number of community projects related with text mining applications.

ASSIST

The ASSIST project investigates the benefits of text mining in two case studies within the social science disciplines. This includes a review of the requirements gathering stage in order to advise future projects in this area and the development of high profile exemplars demonstrating how text mining solutions can solve, in part at least, major challenges facing e-Researchers across all domains.

AstraZeneca Project (Automated Biological Event Extraction from the Literature for Drug Discovery)

This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.

To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.

Arabic WordNet

Arabic WordNet involves the construction of an Arabic WordNet, following the development process of Princeton WordNet and Euro WordNet. It utilizes the Suggested Upper Merged Ontology as an interlingua to link Arabic WordNet to previously developed wordnets.

BBC

The BBC News Browser Pilot Project aims to analyse, structure and visualise BBC news available on the Web according to a user's que ry using advanced text mining techniques. The outcomes include a web demonstrator of two concept clustering tools and presentations to identified sets of potential users within New Media & Technology, News and BBC Monitoring, and to BBC Research/Technology Group.

BOOTStrep

BOOTStrep (Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project) is an international joint EU project (Ref. FP6 - 028099), which aims at building reusable wide-coverage lexical, conceptual and factual knowledge resources for the biology domain, involving the exploitation and combination of existing terminological resources (thesauri, classification systems, etc.) within a common, standardized representation framework.

CheTA

CheTA integrated Cambridge's chemical text mining tool OSCAR with the U-Compare workflow infrastructure developed by NaCTeM and others. This integration has added chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration were accomplished, the project used the CheTA tools to index a corpus of documents of different types and provenance. CheTA developed a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Finally, the economic cost of metadata generation by both human indexers and robots was quantified.

Clinical Trials

The aim of the Clinical Trials project is to develop an efficient search application customised to clinical trials, that aims to address the information overload problem and to assist in the creation of new protocols. Text and data mining methods will be applied to large clinical trial collections in order to enrich clinical trial documents with metadata, that in turn serve as effective tools to narrow down searches.

DECA

The DECA (Disease Extraction with Concept Association) project concerned automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). The aim of this project was to combine the strengths of the NaCTeM text mining software tools, Kleio and FACTA, and to create an efficient search facility for associations between biomedical concepts. Also, a considerable amount of research was put into lexical disambiguation of the biomedical names.

eScholar project

The University of Manchester's eScholar is a search facility that gives researchers access to scholarly work produced by individuals associated with the university. The project involves enriching the current faceted search capabilities of eScholar by customising, adapting and combining existing text mining tools and algorithms, such as keyword extraction, named entity recognition and topic clustering, to foster the discovery of interdisciplinary links. This project will impact on the advancement of new interdisciplinary research, which is reliant on identifying potential synergies between the work of different groups within the university.

FixRep

This joint project from UKOLN, NaCTeM and Knowledge Integration brought together the experience of each partner in text analysis and information extraction techniques in order to complete a practical evaluation of formal metadata generation methods within real world workflows. These included the well-known problem of metadata deposit, and workflows from later in the metadata lifecycle; triage - incremental improvement of metadata through error identification and correction - and normalisation, the increase of consistency for a specific purpose, such as republishing of the record as part of an overlay journal. The suitability of extracted formal metadata for purposes such as creation of metadata records, input into existing services for external subject classification or geographical localisation, and for reviewing resource accessibility and preservation were evaluated.

FLaReNet

The FLaReNet project (Fostering Language Resources Network) is an eContentPlus project funded by the European Commission. The purpose of the project is to develop a common vision of the area of Language Resources and Language Technologies for the coming years, and to foster a European strategy for consolidating the sector and enhancing competitiveness at EU level and worldwide. FLaReNet will analyse the sector along various dimensions: technical, scientific but also organisational, economic, political and legal. Once the more pressing issues have been selected, the mission of FLaReNet is to identify priorities as well as long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry.

Infectious Diseases

In September 2009, the National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), awarded a 5-year contract to to support the biomedical research community's work on infectious diseases.

As part of the contract, NaCTeM is collaborating with Virginia Bioinformatics Insititute (VBI) to integrate vital information on pathogens, provide key resources and tools to scientists, and help researchers to analyze genomic, proteomic and other data arising from infectious disease research.

Integrated Social History Environment for Research (ISHER) - Digging into Social Unrest

ISHER aims to enhance search over digitised resources for social history. Enhancement comes through text mining-based rich semantic metadata extraction for collection indexing, clustering and classification. This then allows semantic search while reducing the manual costs currently involved in such activities.

Interoperability of text mining tools is a key objective and an organizing principle for the software architecture of our project. IBM's Unstructured Information Management Architecture (UIMA) forms the basis of our interoperable text mining platform U-Compare, which has over 50 text mining components in its library, and is extensible so can accommodate ISHER's requirements by including also text mining tools from third parties.

INTUTE

The INTUTE Project aimed to develop an intelligent semantic search service using NaCTeM's text mining tools, which will grant users the benefit of searching within an enhanced subset of the Intute repository, a collection of academic/technical reports under the domain-heading of Bio-medical Science or Social Science.

Japan Science and Technology Agency Project

The aim of the project was to investigate the acquisition of lexical and terminological information for a machine translation environment. The main aspects of the JST project were to investigate the use of machine learning techniques for the development of efficient clustering and classification algorithms to be used for text mining applications, and in particular machine translation.

KISTI Pathway Project

NaCTeM is collaborating with the Korea Institute of Science and Technology Information (KISTI) to develop the next generation of information extraction and text mining systems for supporting and automating various aspects of biomolecular pathway model curation.

Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE and event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions.

Mining the History of Medicine

This project, a cross-disciplinary collaboration between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, seeks to demonstrate the potential of text mining in medical history.

META-NET

META-NET aims to build the foundations of building the technological foundations of a multilingual European information society. Through the Multilingual Europe Technology Alliance (META), META-NET aims to bring together researchers, commercial technology providers, private and corporate language technology users, language professionals and other information society stakeholders. META will prepare the necessary ambitious joint effort towards furthering language technologies as a means towards realising the vision of a Europe united as one single digital market and information space.

ONDEX

The ONDEX project addresses the problem that a prerequisite to a systems approach to biological research is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications. Its aims are to produce a robust, fully featured, extensible, easy to use and professionally-supported data integration framework for systems biology projects to use. A more detailed overview is available in this poster presentation video, given at ISMB 2009 and presented by Chris Rawlings.

OSSMETER

OSSMETER aims to extend the state-of-the-art in the field of automated analysis and measurement of Open Source Software, and develop a platform that will support decision makers in the process of discovering, comparing, assessing and monitoring the health, quality, impact and activity of open-source software.

To achieve this, OSSMETER will compute trustworthy quality indicators by performing advanced analysis and integration of information from diverse sources including the project metadata, source code repositories, communication channels and bug tracking systems of Open Source Software projects.

ParTeM

The ParTeM project (Massively Parallel Processing of Full Text Articles using DEISA) presented a combination of expertise in text mining and high performance computing to enable and run massively parallel text mining applications to scale beyond thousands of processors, since there is an urgent need to find amenable solutions to tackle the problem of data deluge for large-scale text mining applications. The motivation is to process large text datasets from multiple scientific domains within reasonable time. Processing full text articles instead of abstracts will allow researchers/scientists across the world to find increased relationships within text that was not known before. This will only be possible with a system that exploits storage capabilities and the parallel nature of high performance computing platforms by porting a number of advanced text mining techniques to the DEISA platform.

PathText/Refine

Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature to their corresponding descriptions in English text, we have developed PathText to serve as a bridge between these two systems.