NaCTeM

Text Mining Resources

Corpora

English Corpora

  • British National Corpus is a 100 million word collection of samples of written and spoken language from a wide range of sources, which is designed to represent a wide cross-section of current British English, both spoken and written.
  • European Language Resources Association is a not-for-profit organisation who make various language engineering resources available. They also evaluate language engineering technologies.
  • LDC (Linguistic Data Consortium) - creates, collects and distributes speech and text databases, lexicons, and other resources for research and development.
  • International Computer Archive of Modern and Medieval English (ICAME) is an international organization of linguists and information scientists working with English machine-readable texts. ICAME distributes the ICAME Corpus Collection on their web site and publishes the ICAME Journal which appears at least once a year, with articles and information about English computer corpora.
  • The Oxford Text Archive (OTA) is the Arts and Humanities Data Service for Literature, Languages and Linguistics. Together with members of the academic community, they collect, catalogue, and preserve high-quality electronic texts for research and teaching. The OTA currently distributes more than 2000 resources in over 20 different languages, and is actively working to extend its catalogue of holdings.
  • ICE-GB is the British component of the International Corpus of English (ICE). It is available on CD-ROM and a 20,000 word sample corpus from ICE-GB is available for download.
  • The Collins WordbanksOnline English corpus is composed of 56 million words of contemporary written and spoken text.
  • Special Interest Group on the Lexicon (SIGLEX) provides an umbrella for research interests on lexical issues ranging from lexicography and the use of online dictionaries to computational lexical semantics. SIGLEX is also the umbrella organisation for SENSEVAL, evaluation exercises for Word Sense Disambiguation.
  • English Language Interview Corpus as a Second-Language Application (ELISA) is a resource for language learning and teaching, and interpreter training.
  • Wortschatz allows a search in 17 Corpus-Based Monolingual Dictionaries, including English.
  • The KEMPE corpus of Early Modern Playtexts in English contains text corpora, CG-annotated corpora and treebanks and allows three different search techniques.
  • The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles.

Bio-medical Corpora

  • BioInfer is a corpus of 1,100 sentences from biological research articles, annotated with 3 levels of information, i.e.
    • syntactic dependencies
    • named entitites
    • relationships betwen entitites
  • PennBioIE is corpus of 2257 annotated abstracts from PubMed, annotated with the following:
    • paragraph
    • sentence
    • tokenization
    • Part of speech
    • syntactic annotation (642 abstracts)
    • Biomedical entitites
  • AIMed is corpus of 1955 sentences from MEDLINE abstracts annotated with gene/protein names and protein-protein interactions.
  • TREC Genomics Track provides a forum for evaluation of information retrieval systems in the genomics domain. The data from each track is available to everyone (a data usage agreement must be signed to access the document collections).
  • BioMedCentral's corpus has so far published 248830 articles of peer-reviewed biomedical research, all of which are covered by their open access license agreement which allows free distribution and re-use of the full text article, including the highly structured XML version.
  • PathBinder is a collection of sentences extracted from MEDLINE. Every sentence contains 2 or more different biomolecules. A dictionary of 40,000 biomolecules (80,000 names) were used to scan against all MEDLINE abstracts. The sentences are organised in a 2-level indexed structure.
  • Benchmarks and Corpora for BioNLP is a list of corpora used for natural-language processing and text mining in the biomedical domain (compiled by Joerg Hakenberg).