National Centre for Text Mining — Text Mining Resources

This page provides links to varous types of resources, developed both by NaCTeM and externally, which may be used in the development and/or evaluation of text mining systems.

Corpora

This section provides links to corpora of various sizes, with different levels of annotation, and belonging to different domains. Several of these corpora have been manually annotated by experts with semantic information, as a means to train and/or evaluate machine-learned text mining tools.

ACE Meta-knowledge - An enrichment of the English ACE 2005 corpus, relating to news, in which information pertaining to various aspects of event interpretation has been added.
Anatomy Corpora - A collection of corpora manually annotated with fine-grained, species-independent anatomical entities, to facilitate the development of text mining systems that can carry out detailed and comprehensive analyses of biomedical scientific text.
BioCause - A collection of 19 full text biomedical articles, in which previously added entity and event annotations have been enriched with causality annotations.
ChEBI - A collection of 199 biomedical abstracts and 100 full text biomedical articles, annotated with named entities and relationships between them.
CHR - A distantly supervised dataset of 12,094 PudMed abstracts and their titles, dealing with binary interactions between chemicals.
Controllable readablity corpus - A corpus consisting of 28,124 peer-reviewed biomedical research papers, along with their technical and plain language summaries, aimed at facilitating the development of summarisation systems that can generate both technical and lay summaries of biomedical articles.
COPD - A collection of 30 full-text articles, manually annotated with named entities, using a fine-grained annotation scheme that aims to capture detailed information about COPD phenotytypes.
GENIA - A collection of 2000 biomedical abstracts, which various levels of syntactic and semantic annotations.
GENIA Meta-Knowledge - An enrichment of the GENIA event corpus, with various aspects of information pertaining to the interpretation of events
GREC - A collection of 240 MEDLINE abstracts, annotated with events pertaining to gene regulation.
HIMERA - A corpus of published historical medical documents manually annotated with semantic information relevant to the study of medical history and public health.
Metabolite and Enzyme Corpus - A corpus of Medline abstracts annotated by experts with metabolite and enzyme names.
MC-Fake - A fake news dataset containing 28334 news events on multiple topics (Politics, Entertainment, Health, Covid-19, Syria War) and their corresponding social context.
Occupational Exposure - A corpus consisting of selected sections (i.e., Abstract, Methods and Results) of scientific research articles concerning occupational exposures to two different types of substance, i.e., diesel exhaust (51 articles) and respirable crystalline silica (RCS) (50 articles). The article sections have been annotated by experts in the field with 6 categories of named entities (NEs) relevant to the assessment of occupational substance exposures, particularly in the context of Job Exposure Matrices (JEMs)
PHAEDRA - A semantically annotated corpus for pharmacovigilence. The corpus includes five different levels of information, which allow detailed information about drug effects to be encoded.
PhenoCHF - A corpus consisting of biomedical articles and clinincal records, annotated with phenotypic information related with congestive heart failure (CHF). Various levels of anonotation are included, i.e., entity mentions, their normalisation to concept IDs in the UMLS Metathesarus, and relations involving entity mentions.

Terminologies

Time-sensitive medical inventory - A collection of terms relevant to the study of medical history, each linked to other semantically-related terms

Biomedical Text Mining for Chinese

A number of tools and resources have been developed at NaCTeM to aid with the processing of biomedical text in the Chinese language.

BioLexicon

NaCTeM was involved in the development of this large-scale terminological resource to support text mining in the biomedical domain.

Anatomy Resources

A collection of tools and lexical resources, developed at NaCTeM, making use of anatomy domain ontologies available in the OBO Foundry collection of Open Biological and Biomedical Ontologies to facilitate anatomical entity mention detection and classification.

Evaluation

Evaluations of text mining systems are usually done by comparing their performance on a common task using common data set. Links to several well-known community evaluations, some of which provide annotated resources, are provided.