Chemistry Using Text Annotations
This project (CheTA) will integrate Cambridge's chemical text mining tool OSCAR with the U-Compare workflow infrastructure developed by NaCTeM and others. This integration adds chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration have been accomplished, the project will use the CheTA tools to index a corpus of documents of different types and provenance. CheTA will develop a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Finally, the economic cost of metadata generation by both human indexers and robots will be quantified.
It is expected that the application of professionally maintained, automated and sustainable text mining services, enabled by CheTA to public information sources such as PubMed, will lead to significant future enhancements in resource discovery.
As part of CheTA, OSCAR has been refactored into different workflows (a sequence of individual components to perform a certain task, in this case named entity recognition of chemical elements).
A talk about OSCAR and U-Compare, presented at the OSCAR4 launch event, is available to watch online.
System requirements to run the workflows:
- Download the workflows listed below and save them on your machine
- U-Compare; load the U-Compare interface by clicking here
- Read about loading and running workflows in U-Compare
- Load the workflows in U-Compare to annotate your files
- Your input files *must* be in a text or an xml format
Workflows
If you are using Safari, you might want to right click (ctrl+click) to download the workflows.
Corpus
Our workflows were evaluated against the SciBorg corpus, a document collection consisting of 42 full-text journal papers from the Royal Society of Chemistry (RSC). It contains chemical named entity annotations categorised into the following types: chemical compound (CM), chemical reaction (RN), chemical adjective (CJ), enzyme (ASE) and chemical prefix (CPR).The complete version of the SciBorg corpus is available for download. NOTE: Please observe the terms of the SciBorg corpus licence when downloading the corpus.
Argo
Argo is a web-based system for collaborative development of text mining workflows. Amongst the built in components available is OscarMER, which runs Oscar 3 with a maximum entropy based recogniser.
Publications
- Kontonatsios, G., Korkontzelos, I., Kolluru, B., Thompson, P. and Ananiadou, S. (In Press). Deploying and Sharing U-Compare Workflows as Web Services. Journal of Biomedical Semantics
- Kolluru, B., Hawizy, L., Murray-Rust, P., Tsujii, J. and Ananiadou, S. (2011) Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry. PLoS ONE 6(5): e20181.
- Kontonatsios, G., Korkontzelos, I., Kolluru, B. and Ananiadou, S. (2011). Adding Text Mining Workflows as Web Services to the BioCatalogue. In Proceedings of the 4th International Workshop on Semantic Web Aplications and Tools for the Life Sciences (SWAT4LS).
- Soldatova, L. N., Kolluru, B., King, R. D., Qi, D. and Ananiadou, S. (2011). An ontology-based disambiguation of terms. In Proceedings of the Workshop on Mining the Pharmacogenomics Literature, Pacific Symopsium on Biocomputing
- Corbett, P., Batchelor, C. and Teufel, S. (2007). Annotation of chemical named entities. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, Prague, Czech Republic, 57-64.
Featured News
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- Call for papers: CL4Health @ NAACL 2025
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine