NaCTeM

Text Mining Resources

This page provides links to varous types of resources, developed both by NaCTeM and externally, which may be used in the development and/or evaluation of text mining systems.

Corpora

This section provides links to corpora of various sizes, with different levels of annotation, and belonging to different domains. Several of these corpora have been manually annotated by experts with semantic information, as a means to train and/or evaluate machine-learned text mining tools.

  • ACE Meta-knowledge - An enrichment of the English ACE 2005 corpus, relating to news, in which information pertaining to various aspects of event interpretation has been added.
  • Anatomy Corpora - A collection of corpora manually annotated with fine-grained, species-independent anatomical entities, to facilitate the development of text mining systems that can carry out detailed and comprehensive analyses of biomedical scientific text.
  • BioCause - A collection of 19 full text biomedical articles, in which previously added entity and event annotations have been enriched with causality annotations.
  • GENIA - A collection of 2000 biomedical abstracts, which various levels of syntactic and semantic annotations.
  • GENIA Meta-Knowledge - An enrichment of the GENIA event corpus, with various aspects of information pertaining to the interpretation of events
  • GREC - A collection of 240 MEDLINE abstracts, annotated with events pertaining to gene regulation.
  • HIMERA - A corpus of published historical medical documents manually annotated with semantic information relevant to the study of medical history and public health.
  • Metabolite and Enzyme Corpus - A corpus of Medline abstracts annotated by experts with metabolite and enzyme names.
  • PhenoCHF - A corpus consisting of biomedical articles and clinincal records, annotated with phenotypic information related with congestive heart failure (CHF). Various levels of anonotation are included, i.e., entity mentions, their normalisation to concept IDs in the UMLS Metathesarus, and relations involving entity mentions.

Terminologies

Biomedical Text Mining for Chinese

A number of tools and resources have been developed at NaCTeM to aid with the processing of biomedical text in the Chinese language.

BioLexicon

NaCTeM was involved in the development of this large-scale terminological resource to support text mining in the biomedical domain.

Anatomy Resources

A collection of tools and lexical resources, developed at NaCTeM, making use of anatomy domain ontologies available in the OBO Foundry collection of Open Biological and Biomedical Ontologies to facilitate anatomical entity mention detection and classification.

Evaluation

Evaluations of text mining systems are usually done by comparing their performance on a common task using common data set. Links to several well-known community evaluations, some of which provide annotated resources, are provided.