Text Mining for Biodiversity


NaCTeM has produced a number of tools and resources, with the overarching aim of facilitating the collaborative study and discussion of legacy biodiversity documents by a worldwide community.

This page provides brief details of the following:

  • The individual projects in which the work has been carried out.
  • The outcomes of the projects, in terms of tools, resources, publications and talks.


Mining Biodiversity

The Mining Biodiversity project is an international collaboration between the National Centre for Text Mining (UK), Missouri Botanical Garden (US), Dalhousie University's Big Data Analytics Institute (Canada) and Ryerson University's Social Media Lab(Canada). Its overarching goal is to transform the Biodiversity Heritage Library (BHL), a digital library of over 40 million pages of taxonomic literature, into a next-generation social digital resource. In this project, methods for text mining, visualisation and social media analysis were developed to effectively serve BHL users with semantically enriched content. The resulting digital resource provides access to the full content of BHL library documents via semantically enhanced, interactive browsing and searching capabilities, allowing users to more efficiently locate information of interest to them.


The COPIOUS project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.


Tools and resources

Terminological Inventory for Biodiversity

We have compiled a terminological inventory for biodiversity by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). More information is available at

Biodiversity Search System

A visual text analytics search system for biodiversity that uses the terminological inventory as metadata for query expansion is available at

Named Entity Recognition

Two models of Taxon and Habitat detection have been incorporated into the Argo component of NERSuite Custom Tagger, in which users can select the model they would like to apply to their text. We also complied two dictionaries that can be used to automatically ground, i.e., to assign an identifier to, a detected Taxon or Habitat. The dictionary for grounding Taxon entities was created by collecting available names from the Catalogue of Life (CoL). Regarding Habitat entities, we constructed the dictionary by extracting all terms provided by the Environment Ontology (ENVO). Given an ID provided by CoL or ENVO, we can link the entity back to the original ontology. The two dictionaries are available in an Argo component named Concept Normaliser. A demonstration workflow, named COPIOUS-Taxon and Habitat, has been made publicly available at


Maolin Li, Nhung Nguyen, Sophia Ananiadou (In Press). Proactive Learning for Named Entity Recognition. In Proceedings of the BioNLP Workshop 2017.

Nguyen NTH, Soto AJ, Kontonatsios G, Batista-Navarro R, Ananiadou S (2017). Constructing a biodiversity terminological inventory. PLoS ONE 12(4): e0175277.

Batista-Navarro R., Zerva C., Nguyen N.T.H., Ananiadou S. (2017). A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura J., Alatrista-Salas H. (eds) Information Management and Big Data. SIMBig 2015, SIMBig 2016. Communications in Computer and Information Science, vol 656. Springer

Batista-Navarro, R., Hammock, J., Ulate, W. and Ananiadou, S. (2016). A Text Mining Framework for Accelerating the Semantic Curation of Literature. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016), pp. 459-462, Springer

Batista-Navarro, R., Soto, A., Ulate, W. and Ananiadou, S. (2016). Text Mining Workflows for Indexing Archives with Automatically Extracted Semantic Metadata. In: Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016), pp. 471-473, Springer


A talk entitled Enriching the legacy literature with OCR corrections and text-mined semantic metadata at the Annual Conference of Biodiversity Information Standards (TDWG) 2014 in Jönköping, Sweden.

A talk entitled Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction at the Annual Conference of Biodiversity Information Standards (TDWG) 2015 in Nairobi, Kenya.

A tutorial entitled Text mining workflows for indexing archives with automatically extracted semantic metadata at the 20th International Conference on Theory and Practice of Digital Libraries in Hannover, Germany.

A talk entitled Expanding Access to Biodiversity Literature at the 48th Annual Meeting of the Council on Botanical and Horticultural Libraries in Cleveland, Ohio

A talk entitled Enhancing Semantic Search through the Automatic Construction of a Biodiversity Terminological Inventory at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.

A talk entitled Text mining tools and infrastructure for biomedical applications: cancer biology, history of medicine, monitoring biodiversity at the Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece, 2015

A talk entitled Re-usable text mining workflows for advanced search at the First Workshop on Text Mining in Natural Sciences (TMINS-1): Exploring Text Mining in Marine, Climate and Environmental Science, 12-13th November 2015, Norwegian University of Science and Technology (NTNU), Trondheim, Norway.

