A Terminological Inventory for Biodiversity
Description
In order to construct the inventory, we firstly compiled a species name dictionary by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). The terms contained in this dictionary were then located within the text of English BHL documents (about 24 million pages of text) using a string matching method. We then learned vector representations of the identified terms using three different approaches, namely count-based, prediction-based and compositional distributional semantic models (DSMs). These approaches compute vector representations for both single and multi-word terms. The cosine similarity between two vectors serves as an indicator of the semantic relatedness between the corresponding terms: the higher the cosine similarity, the greater the relatedness of the two terms. We finally select the top-k candidates as the terms that are most semantically related to a given term.
The inventory contains 288,562 names of species whose frequency in BHL documents is at least five. For each term in the inventory, the 20 topmost semantically similar terms are provided, together with their corresponding similarity scores. To facilitate further digital biodiversity processes, each term is also linked to its URI, UUID and LSID indexed by Global Names.
A search interface that uses the inventory as metadata for query expansion is available at http://nactem.ac.uk/BHLQueryExpansion/.
Availability
The inventory is available to download. Please observe the terms and conditions of the licence (see below).
Licence
The Terminological Inventory for Biodiversity was created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK.
It is licensed under a Creative Commons Attribution 4.0 International License.
Please attribute NaCTeM when using the corpus and cite the following paper:
Nguyen, N. T. H., Soto, A., Kontonatsios, G., Batista-Navarro, R. and Ananiadou, S. (2017). Constructing a Biodiversity Terminological Inventory. PLOS ONE, 12(4), e0175277.
Featured News
- 24-month postdoctoral research position in Athens, Greece
- PhD opportunity in collaboration with Athens Univ. of Economics and Business
- iCASE EPSRC funded PhD- multimodal NLP - UoM & BAE - Application deadline 30th April 2024
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- CFP: BIONLP 2024 and Shared Tasks @ ACL 2024
- Advances in Data Science and Artificial Intelligence Conference 2024
Other News & Events
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- New review article on emotion detection for misinformation
- BioNLP 2024 accepted as workshop at ACL 2024
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
- Chinese Government AwardAward for PhD student Tianlin Zhang