Automated Biological Event Extraction from the Literature for Drug Discovery
Overview
This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.
To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.
NaCTeM will carry out research on automatic event and biological process recognition from texts, with help from AstraZeneca's domain expertiese.
Demonstration Systems
FACTA+ for Cancer Research
Angiogenesis is the physiological process involving the growth of new blood vessels from pre-existing vessels. Identifying genes and other molecules that regulate this bio-process has become an important line of research in cancer treatment. A new version of FACTA+ is available for finding Angiogenesis-associated genes and other biological entities from text.
Angiogenesis Event Extraction
We built a text mining pipeline that extracts and highlights terms and events describing angiogenesis bioprocess, as well as biological entities such as gene, gene product, tissue and cell. A web-based demonstration system is available HERE.News
Success in the BioCreAtIvE III Challeges
NaCTeM took part in the BioCreAtIvE (Critical Assessment of Information Extraction in Biology) challenge for 2010. The team participated in the protein-protein interaction (PPI) challenge and achieved the best performance, in the Interaction Method Task (IMT). This involves automatically detecting experimental techniques used in research articles that support given PPIs. Such detection is crucial not only for the correct annotation of experimentally determined protein interactions but also for other annotations, such as evidence codes in the Gene Ontology, and assigning other controlled vocabulary terms to an article. Among systems submitted by 8 international teams, NaCTeM's yielded the best overall performance as measured by a range of evaluation metrics. The NaCTeM BioCreAtIvE team consisted of S.Ananiadou, R.T. Batista-Navarro, R. Nawaz, C. Nobata, R. Rak, A. Restificar, C.J. Rupp and X. Wang.
Background
Over the last decade, despite a doubling in industrial and public funding for biomedical research, approval of new medical entities by regulatory agencies has halved. Only 11% of molecules that enter the pre-clinical development reach the market, resulting in the average R&D costs for a new medicine at an estimated $454m. Among the most commonly cited reasons for this high clinical attrition rate, especially in later phase III clinical trials, are idiosyncratic drug induced toxicity and the lack of drug efficacy over and above placebos, especially if the compound has a novel mechanism of action.
To reduce this high drug attrition rate in late phase clinical, we crucially need to improve `Confidence in Rationale' of the candidate drug target(s). Such confidence comes from clear scientific evidence of how, when modulated, a target affects critical pathophysiological processes leading to either the disease cure, prevention or amelioration of symptoms in the clinical setting. Typically a bank of pre-clinical evidence is developed using cell lines, model organisms and clinical samples associating a target with key bioprocesses (and so disease phenotype). However, the primary starting point for target choice, and the context for interpretation of all pre-clinical observations, is literature. However, manual techniques and conventional information retrieval techniques are unable to deliver timely, reliable, exhaustive and specific results given the vastness of the literature and its speed of growth. There is thus an immediate and urgent need to advanced automated means to support drug target identification by flagging protein involvement in key bioprocesses and tracking accumulation of evidence over time (hypothesis generation).
Text mining (TM) is increasingly used to suport knowledge discovery, hypothesis generation and to manage the mass of biological literature. However, the TM systems commonly seen that help researchers to discover direct associations between biomedical terms typically rely on co-occurrence approaches, looking at the frequency of co-occurrence of entities in the same articles or sentences. Such approaches often fail to recognise the underlying mechanisms in terms of involvement of biological entities in biological processes, because surface words are often ambiguous (i.e., have different meanings depending on the context), in which case deep semantic analysis of the text is required.
Project Team
- NaCTeM Team
-
- Principal Investigator: Prof. Sophia Ananiadou
- Co-investigator: Prof. Jun-ichi Tsujii
- Lead Researcher: Dr. Sampo Pyysalo
- Associated Researcher: Dr. Makoto Miwa
- AstraZeneca Team
-
- Advisory Group: Dr. Ian Dix, Dr. Tim French, Dr. Mark Pearson and Dr. Darren Cross
- Delivery Team: Mr. Iain McKendrick and Dr. Ian Barrett
Publications
Mu, T., Goulermas, J. Y, Tsujii, J. and Ananiadou, S. (2012). Proximity-based Frameworks for Generating Embeddings from Multi-Output Data. IEEE Transactions on Pattern Analysis and Machine Intelligence
Kano, Y., Björne, J., Ginter, F., Salakoski, T., Buyko, E., Hahn, U., Cohen, K. B., Verspoor, K., Roeder, C., Hunter, L., Kilicoglu, H., Bergler, S., Van Landeghem, S., Van Parys, T., Van de Peer, Y., Miwa, M., Ananiadou, S., Neves, M., Pascual-Montano, A., Ozgur, A., Radev, D. R., Riedel, S., Sætre, R., Chun, H.-W., Kim, J.-D., Pyysalo, S., Ohta, T. and Tsujii, J. (2011). U-Compare bio-event meta-service: compatible BioNLP event extraction services. BMC Bioinformatics, 12, 481
Ohta, T., Pyysalo, S., Ananiadou, S. and Tsujii, J. (2011). Pathway Curation Support as an Information Extraction Task. In Proceedings of the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011) .
Pyysalo, S., Ohta, T. and Ananiadou, S. (2011). Anatomical Entity Recognition with Open Biomedical Ontologies. In Proceedings of the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011) .
X. Wang, I. McKendrick, I. Barrett, I. Dix, T. French, J. Tsujii and S. Ananiadou (2011). Automatic Extraction of Angiogenesis Bio-Process from Text. Bioinformatics, 27(19), 2730-2737.
Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, J. and S. Ananiadou (2011) Discovering and visualising indirect associations between biomedical concepts. Bioinformatics, 27 (13), i111-i119.
X. Wang, R. Rak, A. Restificar, C. Nobata, C. Rupp, R. Batista-Navarro, R. Nawaz and S. Ananiadou. (2011). Detecting Experimental Techniques and Selecting Relevant Documents for Protein-Protein Interactions from Biomedical Literature. BMC Bioinformatics, 12(Suppl 8):S11 (Best performing system in BioCreative III's Interaction Method Task.)
T. Mu and S. Ananiadou. (2010). Proximity-based graph embeddings for multi-label classification. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2010), Valencia, Spain.
X. Wang, R. Rak, A. Restificar, C. Nobata, C.J. Rupp, R.T.B. Batista-Navarro, R. Nawaz and S. Ananiadou. (2010). NaCTeM Systems for BioCreative III PPI Tasks. In Proceedings of the BioCreative III Workshop. Bethesda, MD, USA.
T. Mu, X. Wang, J. Tsujii and S. Ananiadou (2010). Imbalanced classification using dictionary-based Prototypes and Hierarchical Decision Rules for Entity Sense Disambiguation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 851-859.

Featured News
- New paper on dimensionality reduction for multi-label classification
- New homepage for the GENIA project and biomedical annotated corpora
- Detection and classification of anatomical entities - new resources, tools and paper
- Third Workshop on Building and Evaluating Resources for Biomedical Text Mining - Call for Papers
- Detecting Structure in Scholarly Discourse - Call for papers
- NaCTeM to join forces with Elsevier to develop SciVerse Applications
- Prof. Ananiadou to give keynote speech at IHI 2012 - Call for participation
Other News & Events
- Event at House of Commons to discuss Hargreaves Review
- Computational Intelligence special issue on BioNLP Shared Task 2009 published
- Special issue of BMC Bioinformatics on BioCreative III
- Invited talk at STM Innovations Seminar 2011
- Invited talk at IPRC Workshop "Copyright exceptions in the UK: time for reform?"





