Text Mining Resources
Corpora
English Corpora
- British National Corpus is a
100 million word collection of samples of written and spoken language from
a wide range of sources, which is designed to represent a wide cross-section
of current British English, both spoken and written.
- Assorted frequency lists for the British National Corpus.
- European Language Resources Association is a not-for-profit organisation who make various language engineering resources available. They also evaluate language engineering technologies.
- LDC (Linguistic Data Consortium) - creates, collects and distributes speech and text databases, lexicons, and other resources for research and development.
- International Computer Archive of Modern and Medieval English (ICAME) is an international organization of linguists and information scientists working with English machine-readable texts. ICAME distributes the ICAME Corpus Collection on their web site and publishes the ICAME Journal which appears at least once a year, with articles and information about English computer corpora.
- The Oxford Text Archive (OTA) is the Arts and Humanities Data Service for Literature, Languages and Linguistics. Together with members of the academic community, they collect, catalogue, and preserve high-quality electronic texts for research and teaching. The OTA currently distributes more than 2000 resources in over 20 different languages, and is actively working to extend its catalogue of holdings.
- ICE-GB is the British component of the International Corpus of English (ICE). It is available on CD-ROM and a 20,000 word sample corpus from ICE-GB is available for download.
- The Collins WordbanksOnline English corpus is composed of 56 million words of contemporary written and spoken text.
- Special Interest Group on the Lexicon (SIGLEX) provides an umbrella for research interests on lexical issues ranging from lexicography and the use of online dictionaries to computational lexical semantics. SIGLEX is also the umbrella organisation for SENSEVAL, evaluation exercises for Word Sense Disambiguation.
- English Language Interview Corpus as a Second-Language Application (ELISA) is a resource for language learning and teaching, and interpreter training.
- Wortschatz allows a search in 17 Corpus-Based Monolingual Dictionaries, including English.
- The KEMPE Korpus of Early Modern Playtexts in English contains text corpora, CG-annotated corpora and treebanks and allows three different search techniques.
- The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles.
Bio-medical Corpora
-
The GENIA Corpus is a collection of biomedical literature. It has been compiled and annotated within the scope of the GENIA project. The goal of the project is to develop text mining (TM) systems for the domain of molecular biology. The GENIA corpus has been developed to provide a reference material for the development of bio-TM systems. The corpus currently contains 1,999 Medline abstracts which were collected using the three MeSH terms, "human", "blood cells", and "transcription factors".
The corpus has been annotated with various levels of linguistic and semantic information, including:
- Part-of-speech
- Treebank
- Coreference
- Terms
- Events
- Cellular localization
- Disease-Gene Association
- Pathway corpus
- BioNLP'09 Shared Task data set. This consists of portion of the GENIA event corpus in stand-off annotation format. It was created to facilitate the training and testing of IE systems for the shared task at BioNLP'09, which was concerned with the recognition of bio-event in biomedical literature
- BioInfer is a corpus of 1,100 sentences from
biological research articles, annotated with 3 levels of information, i.e.
- syntactic dependencies
- named entitites
- relationships betwen entitites
- PennBioIE is
corpus of 2257 annotated abstracts from PubMed, annotated with the following:
- paragraph
- sentence
- tokenization
- Part of speech
- syntactic annotation (642 abstracts)
- Biomedical entitites
- GENETAG is corpus of 20,000 sentences from MEDLINE annotated with gene/protein names.
- AIMed is corpus of 1955 sentences from MEDLINE abstracts annotated with gene/protein names and protein-protein interactions.
- TREC Genomics Track provides a forum for evaluation of information retrieval systems in the genomics domain. The data from each track is available to everyone (a data usage agreement must be signed to access the document collections).
- BioMedCentral's corpus has so far published 9325 articles of peer-reviewed biomedical research, all of which are covered by their open access license agreement which allows free distribution and re-use of the full text article, including the highly structured XML version.
- PathBinder is a collection of sentences extracted from MEDLINE. Every sentence contains 2 or more different biomolecules. A dictionary of 40,000 biomolecules (80,000 names) were used to scan against all MEDLINE abstracts. The sentences are organised in a 2-level indexed structure.
- The Yapex Collections of MEDLINE abstracts are two collections consisting of MEDLINE abstracts obtained in different ways: 1) a document set obtained by posing the query 'protein binding [Mesh term] AND interaction AND molecular' with the parameters 'abstract', 'english', 'human', and 'publication date 1996-2001' to MEDLINE; and 2) abstracts of the test collection were randomly chosen from the GENIA corpus.
- Benchmarks and Corpora for BioNLP is a list of corpora used for natural-language processing and text mining in the biomedical domain (compiled by Joerg Hakenberg).
- Corpora for biomedical NLP is a list compiled by the Biomedical Text Mining Group at the Center for Computational Pharmacology. It now contains links to more than twenty different corpora and text collections, all focussed on biomedical domains, as well as links to publications on biomedical corpora.
Featured News
- Call for papers - BioNLP 2010
- Text Mining for Publishers event - 11th May 2010, London
- Launch of new features on UKPMC website
- Species disambiguation of biomedical named entities- release of software, corpus and article
- Call for papers - 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining
- New Biomedical Event Corpus (GREC) released
- ELRA Distribution Agreement signed for BioLexicon



