Biological terminology is a frequent cause of analysis errors when processing literature written in the biology domain. For example, "retro-regulate" is a terminological verb often used in molecular biology but it is not included in conventional dictionaries.
The BioLexicon is a linguistic resource tailored for the biology domain to cope with these problems. It contains the following types of entries:
- a set of terminological verbs
- a set of derived forms of the terminological verbs
- general English words frequently used in the biology domain
- domain terms
This comprehensive coverage of biological terms makes the lexicon a unique linguistic resource within the domain.
Over the past twenty years, there have been remarkable advances in natural language processing (NLP) and text mining (TM) technologies. Various practical NLP/TM tools, such as part-of-speech taggers, chunkers, syntactic parsers and named entity recognizers, are now widely available. However, text in biology exhibits different characteristics from general language documents such as newspaper articles. The biology domain demonstrates strong demands for the results of NLP/TM. However, it is also one of the most challenging domains for text processing. Lack of coverage of the following types of terminological information makes NLP/TM tasks in this domain difficult:
- Large-scale domain-specific terminologies
- Domain-specific word usage
- Domain-specific relations between words
Technical terms are a major barrier to bio-text processing. A huge number of biological, chemical and medical terms appear in the literature and new terms are coined every day. Furthermore, there are many spelling and semantic variants of these terms representing the same biomedical entities in different written forms. For example, the BioThesaurus contains more than 15 million gene/protein names, but still it does not cover the wide variety of variants of gene/protein names actually appearing in the literature. Word usage can be idiosyncratic to the bio-domain as well. For example, express often indicates a specific biological process, gene expression, and takes as arguments specific types of named entities, such as gene and protein names. In addition, there are many cases where words are related in a biology-specific manner. For example, the verb retroregulate has retroregulation as its nominal form and retroregulatory as its adjectival form. This extent of derivational relations between words in the biological domain cannot be fully covered by general English dictionaries and thesauri, e.g., WordNet. To the best of our knowledge, there is no biology-specific lexicon that addresses the above linguistic issues.
- Thompson, Paul, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, Valeria Quochi, C.J. Rupp, Yutaka Sasaki, Giulia Venturi, Dietrich Rebholz-Schuhmann and Sophia Ananiadou. The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics 2011, 12:397.
- Sasaki, Yutaka, John McNaught and Sophia Ananiadou. The value of an in-domain lexicon in genomics qa. Journal of bioinformatics and computational biology, 8(1), 147--161, 2010
- Sasaki, Yutaka, Paul Thompson, John McNaught and Sophia Ananiadou. Three BioNLP Tools Powered by the BioLexicon. In Proc. of EACL 2009 Demonstration Session, pp. 61--64, 2009
- Venturi, Giulia, Simonetta Montemagni, Simone Marchi, Yutaka Sasaki, Paul Thompson, John McNaught and Sophia Ananiadou. Bootstrapping a Verb Lexicon for Biomedical Information Extraction. In Gelbukh, A.(Ed.) Proceedings of the 10th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2009), pp. 137--148, Springer, 2009
- Sasaki, Yutaka, Simonetta Montemagni, Piotr Pezik, Dietrich Rebholz-Schuhmann, John McNaught and Sophia Ananiadou. BioLexicon: A Lexical Resource for the Biology Domain. In Proc. of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), 2008.
- Rebholz-Schuhmann, Dietrich, Piotr Pezik, Vivian Lee, Jung-Jae Kim, Riccardo del Gratta, Yutaka Sasaki, Jock McNaught, Simonetta Montemagni, Monica Monachini, Nicoletta Calzolari and Sophia Ananiadou. Towards a Reference Terminological Resource in the Biomedical Domain. In Proc. of 16th Ann. Int. Conf. on Intelligent Systems for Molecular Biology (ISMB-2008), Toronto, Canada, 2008.
- Participation in event on copyright and the case of text and data mining at European Parliament
- New paper and resources to support anatomical entity recognition at literature scale
- Keynote speech Pharma Documentation Ring special meeting in Bruges
- COLING 2014
- NaCTeM success at BioCreative IV
- Participation in Workshop on Text and Data Mining for Data Driven Innovation - Highlights available
- NaCTeM student selected to participate in Global Young Scientists Summit
Other News & Events
- NaCTeM papers accepted at ACL
- New paper on integrating and ranking textual evidence for biochemical pathways
- UK Government publishes draft legislation on copyright exception for data analysis
- ICHI - Call for Participation
- New paper on wide-coverage event extraction using multiple partially overlapping corpora