OverviewThis is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and MIMAS forming a work package in the Europe PMC project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PuBMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies. Our contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.
ChallengesThis project is a scale up of existing text mining applications as it applies to the full text of the papers in the Europe PMC collection (currently 2.2m and growing). It also provides this functionality to an clearly targeted and equally expanding user community. In addition to the increase in the amount of text in the corpus to be annotated, the structure of the documents as research papers must be taken into account. The semantic search capabilities must be made both accessible and intuitive to the users, while maintaining both the efficiency and quality of the results.
ObjectivesThe specific project objectives are:
- Deliver content from annotated documents (e.g., identified concepts, links to databases, relations amongst concepts) to the “related arts” segment in UKPMC.
- Customize and implement cutting edge, high performance named entity recognisers for selected semantic types, and disambiguation modules for named entity types, prioritised for end-users
- Customize and implement cutting edge, high performance linguistic analyzers for extracting a variety of biomedical facts of interest to the users.
- Annotate PMC documents with biomedical named entities, concepts and facts (using 1,2,3) and provide improved document representations where the contained concepts are linked to relevant biomedical databases to facilitate easy navigation between databases and the literature.
- Develop UKPMC search functionality. Use the concept annotations to index documents for information retrieval and provide automatic comparisons of documents to find related information (using 1,2,3). Rank facts and documents based on user interest and queries.
- Make generated resources publicly available wherever possible.
EvidenceFinder, a search tool based on text mining technology, is now available to test on the Europe PMC Labs website. EvidenceFinder presents the user with a list of questions relating to their query terms. For example, given the search term "IL-2", EvidenceFinder will present questions such as What inhibits IL-2 receptor?, What binds to IL-2 receptor?, etc. These questions allow statements in the text to be located that discuss the search topic in specific ways. This allows information to be located that might otherwise be missed, and to quickly establish which articles do and do not contain information being sought.
Project TeamPrincipal Investigator: Prof. Sophia Ananiadou
Co-investigators: Mr John McNaught
Project Team (NaCTeM): Mr. Jacob Carter.
Past team members (NaCTeM): Mr. William Black, Dr. Makoto Miwa, Dr. Rafal Rak, Dr. Andrew Rowley.
Information about the latest Europe PMC developments can be found via the following channels: Europe PMC blog and two Europe PMC Twitter accounts provides more information about latest developments:
- Europe PMC blog
- Europe PMC Articles Twitter account - An RSS-feed of new free articles added to Europe PMC, affiliated to Europe PMC funders.
- Europe PMC News Twitter account - The latest news about Europe PMC
Batista-Navarro, R. T. B., Rak, R. and Ananiadou, S. (2015). Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics. Journal of Cheminformatics, 7(Suppl 1), S6
Pyysalo, S. and Ananiadou, S. (2014). Anatomical Entity Mention Recognition at Literature Scale. Bioinformatics, 30(6), 868-875
Rak, R., Batista-Navarro, R. T. B., Carter, J., Rowley, A. and Ananiadou, S. (2014). Processing Biological Literature with Customisable Web Services Supporting Interoperable Formats. Database: The Journal of Biological Databases and Curation
Rak, R., Batista-Navarro, R. T. B., Rowley, A., Carter, J. and Ananiadou, S. (2014). Text Mining-assisted Biocuration Workflows in Argo. Database: The Journal of Biological Databases and Curation
Rak, R., Rowley, A., Carter, J., Batista-Navarro, R. T. B. and Ananiadou, S. (2014). Interoperability and Customisation of Annotation Schemata in Argo. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, pp. 3837-3842, European Language Resources Association (ELRA)
Batista-Navarro, R. T. B., Rak, R. and Ananiadou, S. (2013). Chemistry-specific Features and Heuristics for Developing a CRF-based Chemical Named Entity Recogniser. In Proceedings of the Fourth BioCreative Challenge Evaluation Workshop, Bethesda, Maryland, USA, pp. 55-59
Black, W.J., Rupp, C. J., Nobata, C.,McNaught, J., Tsujii, J. and Ananiadou, S. (2010). High-Precision Semantic Search by Generating and Testing Questions. In Proceedings of the UK e-Science All Hands Meeting 2010.
McEntyre, J. R., Ananiadou, S., Andrews, S., Black, W.J., Boulderstone, R., Buttery, P., Chaplin, D., Chevuru, S., Cobley, N., Coleman, L., Davey, P., Gupta, B., Haji-Gholam, L., Hawkins, C., Horne, A., Hubbard, S. J., Kim, J. -H., Lewin, I., Lyte, V., MacIntyre, R., Mansoor, S., Mason, L., McNaught, J., Newbold, E., Nobata, C., Ong, E., Pillai, S., Rebholz-Schuhmann, D., Rosie, H., Rowbotham, R., Rupp, C. J., Stoehr, P. and Vaughan, P. (2010). UKPMC: a full text article resource for the life sciences. Nucleic Acids Research, 39 (Suppl. 1), D58-D65.
Rupp, C. J., Thompson, P., Black, W.J., McNaught, J. and Ananiadou, S. (2010). A Specialised Verb Lexicon as the Basis of Fact Extraction in the Biomedical Domain. In Proceedings of Interdisciplinary Workshop on Verbs: The Identification and Representation of Verb Features (Verb 2010).
- Invited talk at the Data and Computing Infrastructures for Global Linguistic Resources workshop
- NaCTeM success at BioCreative V
- Job opportunity in NLP and machine learning
- Job Opportinity in Clinical Text Mining
- NaCTeM to support the development of innovative pathology tests
- Release of BMC Bioinformatic special issue on BIoNLP Shared Task 2013:Part 1
- NaCTeM mentioned in The Lancet
- Prof. Tsujii appointed director of new Artificial Intelligence Research Center at AIST, Japan
Other News & Events
- Fifth BioCreative Challenge workshop: CALL FOR PARTICIPATION
- New article on improving drug named entity recognition by aggregating heterogeneous methods
- Invited Talk at Imperial College, London
- New article on extracting phenotypic information from text
- New article on detecting biomedical term translations