Related Text Mining Research Projects
In addition to the development of text mining services and software tools, members of the National Centre for Text Mining are involved in a variety of other projects which influence and contribute to the work of the centre:Current Projects
Argo
The objective of the Argo project is to develop a workbench for analysing (primarily annotating) textual data. The workbench, which is accessed as a web application, supports the combination of elementary text-processing components to form comprehensive processing workflows. It provides functionality to manually intervene in the otherwise automatic process of annotation by correcting or creating new annotations, and facilitates user collaboration by providing sharing capabilities for user-owned resources. Argo benefits users such as text-analysis designers by providing an integrated environment for the development of processing workflows; annotators/curators by providing manual annotation functionalities supported by automatic pre-processing and post-processing; and developers by providing a workbench for testing and evaluating text analytics.
AstraZeneca Project (Automated Biological Event Extraction from the Literature for Drug Discovery)
This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.
To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.
Clinical Trials
The aim of the Clinical Trials project is to develop an efficient search application customised to clinical trials, that aims to address the information overload problem and to assist in the creation of new protocols. Text and data mining methods will be applied to large clinical trial collections in order to enrich clinical trial documents with metadata, that in turn serve as effective tools to narrow down searches.
Europe PMC
This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and MIMAS, forming a work package in the Europe PMC project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PuBMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies from the biomedical research funders. Our contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.
FLaReNet
The FLaReNet project (Fostering Language Resources Network) is an eContentPlus project funded by the European Commission. The purpose of the project is to develop a common vision of the area of Language Resources and Language Technologies for the coming years, and to foster a European strategy for consolidating the sector and enhancing competitiveness at EU level and worldwide. FLaReNet will analyse the sector along various dimensions: technical, scientific but also organisational, economic, political and legal. Once the more pressing issues have been selected, the mission of FLaReNet is to identify priorities as well as long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry.
Infectious Diseases
In September 2009, the National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), awarded a 5-year contract to to support the biomedical research community's work on infectious diseases.
As part of the contract, NaCTeM is collaborating with Virginia Bioinformatics Insititute (VBI) to integrate vital information on pathogens, provide key resources and tools to scientists, and help researchers to analyze genomic, proteomic and other data arising from infectious disease research.
Integrated Social History Environment for Research (ISHER) - Digging into Social Unrest
ISHER aims to enhance search over digitised resources for social history. Enhancement comes through text mining-based rich semantic metadata extraction for collection indexing, clustering and classification. This then allows semantic search while reducing the manual costs currently involved in such activities.
Interoperability of text mining tools is a key objective and an organizing principle for the software architecture of our project. IBM's Unstructured Information Management Architecture (UIMA) forms the basis of our interoperable text mining platform U-Compare, which has over 50 text mining components in its library, and is extensible so can accommodate ISHER’s requirements by including also text mining tools from third parties.
KISTI Pathway Project
NaCTeM is collaborating with the Korea Institute of Science and Technology Information (KISTI) to develop the next generation of information extraction and text mining systems for supporting and automating various aspects of biomolecular pathway model curation.
Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE and event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions.
META-NET
META-NET aims to build the foundations of building the technological foundations of a multilingual European information society. Through the Multilingual Europe Technology Alliance (META), META-NET aims to bring together researchers, commercial technology providers, private and corporate language technology users, language professionals and other information society stakeholders. META will prepare the necessary ambitious joint effort towards furthering language technologies as a means towards realising the vision of a Europe united as one single digital market and information space.UIMA
One of the core challenges facing text mining and natural language processing (NLP) researchers and tool developers is the general lack of interoperability between different tools and resources. At NaCTeM we have looked to solve this by adapting our tools to function within the UIMA framework enabling direct interaction with those tools provided by other groups around the world.
Our UIMA work is widely recognised, and Prof. Sophia Ananiadou, the director of NaCTeM, has received IBM UIMA Innovation Awards successively for the years of 2006, 2007 and 2008
Past Projects
ADVISES
The ADVISES project will create a new way of communicating with computers for scientists. At present they have to use difficult tools which require them to speak the computers' language rather than express what they want in English. Worse still, the computer tools don't talk to each other so they have to use separate tools for statistics, then visually display results on a map, etc. We will deal with these problems by analysing the way scientists express their requests in English to create a 'sub-language' - that is, a restricted set of English for asking scientific questions and saying how results should be displayed. A video explaining the implemented demo system is available.
ASSERT
The JISC-funded ASSERT (Automatic Summarisation for Systematic Reviews using Text Mining) project is a continuation of the National Centre for Text Mining into the area of social sciences. The overall aim of ASSERT is to encourage greater participation by the social sciences community in e-Research by developing a summarisation service to facilitate the production of systematic reviews and to support a number of community projects related with text mining applications.
ASSIST
The ASSIST project investigates the benefits of text mining in two case studies within the social science disciplines. This includes a review of the requirements gathering stage in order to advise future projects in this area and the development of high profile exemplars demonstrating how text mining solutions can solve, in part at least, major challenges facing e-Researchers across all domains.
Arabic WordNet
Arabic WordNet involves the construction of an Arabic WordNet, following the development process of Princeton WordNet and Euro WordNet. It utilizes the Suggested Upper Merged Ontology as an interlingua to link Arabic WordNet to previously developed wordnets.
BBC
The BBC News Browser Pilot Project aims to analyse, structure and visualise BBC news available on the Web according to a user's que ry using advanced text mining techniques. The outcomes include a web demonstrator of two concept clustering tools and presentations to identified sets of potential users within New Media & Technology, News and BBC Monitoring, and to BBC Research/Technology Group.
BOOTStrep
BOOTStrep (Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project) is an international joint EU project (Ref. FP6 - 028099), which aims at building reusable wide-coverage lexical, conceptual and factual knowledge resources for the biology domain, involving the exploitation and combination of existing terminological resources (thesauri, classification systems, etc.) within a common, standardized representation framework.
CheTA
CheTA integrated Cambridge's chemical text mining tool OSCAR with the U-Compare workflow infrastructure developed by NaCTeM and others. This integration has added chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration were accomplished, the project used the CheTA tools to index a corpus of documents of different types and provenance. CheTA developed a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Finally, the economic cost of metadata generation by both human indexers and robots was quantified.
DECA
The DECA (Disease Extraction with Concept Association) project concerned automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). The aim of this project was to combine the strengths of the NaCTeM text mining software tools, Kleio and FACTA, and to create an efficient search facility for associations between biomedical concepts. Also, a considerable amount of research was put into lexical disambiguation of the biomedical names.
FixRep
This joint project from UKOLN, NaCTeM and Knowledge Integration brought together the experience of each partner in text analysis and information extraction techniques in order to complete a practical evaluation of formal metadata generation methods within real world workflows. These included the well-known problem of metadata deposit, and workflows from later in the metadata lifecycle; triage - incremental improvement of metadata through error identification and correction - and normalisation, the increase of consistency for a specific purpose, such as republishing of the record as part of an overlay journal. The suitability of extracted formal metadata for purposes such as creation of metadata records, input into existing services for external subject classification or geographical localisation, and for reviewing resource accessibility and preservation were evaluated.INTUTE
The INTUTE Project aimed to develop an intelligent semantic search service using NaCTeM's text mining tools, which will grant users the benefit of searching within an enhanced subset of the Intute repository, a collection of academic/technical reports under the domain-heading of Bio-medical Science or Social Science.
Japan Science and Technology Agency Project
The aim of the project was to investigate the acquisition of lexical and terminological information for a machine translation environment. The main aspects of the JST project were to investigate the use of machine learning techniques for the development of efficient clustering and classification algorithms to be used for text mining applications, and in particular machine translation.
ONDEX
The ONDEX project addresses the problem that a prerequisite to a systems approach to biological research is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications. Its aims are to produce a robust, fully featured, extensible, easy to use and professionally-supported data integration framework for systems biology projects to use. A more detailed overview is available in this poster presentation video, given at ISMB 2009 and presented by Chris Rawlings.
ParTeM
The ParTeM project (Massively Parallel Processing of Full Text Articles using DEISA) presented a combination of expertise in text mining and high performance computing to enable and run massively parallel text mining applications to scale beyond thousands of processors, since there is an urgent need to find amenable solutions to tackle the problem of data deluge for large-scale text mining applications. The motivation is to process large text datasets from multiple scientific domains within reasonable time. Processing full text articles instead of abstracts will allow researchers/scientists across the world to find increased relationships within text that was not known before. This will only be possible with a system that exploits storage capabilities and the parallel nature of high performance computing platforms by porting a number of advanced text mining techniques to the DEISA platform.
PathText/Refine
Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature to their corresponding descriptions in English text, we have developed PathText to serve as a bridge between these two systems.
Featured News
- NaCTeM quoted in Nature journal news
- BioNLP - call for papers
- NaCTeM joins signatories of open letter to EC Commissioners on licences for Europe
- Keynote speech at Neuroinformatics 2013
- BioNLP ST'13: Data Release and 1st Call for Participation
- New paper on analysis and recognition of negated bio-events
- Biomedical causality corpus








