Text Mining Resources
Corpora
English Corpora
- British National Corpus is a
100 million word collection of samples of written and spoken language from
a wide range of sources, which is designed to represent a wide cross-section
of current British English, both spoken and written.
- European Language Resources Association is
a not-for-profit organisation who make various language engineering resources
available. They also evaluate language engineering technologies.
- LDC (Linguistic Data Consortium) -
creates, collects and distributes speech and text databases, lexicons, and
other resources for research and development.
- International Computer Archive
of Modern and Medieval English (ICAME) is an international organization
of linguists and information scientists working with English machine-readable
texts. ICAME distributes the ICAME Corpus Collection on their web site
and publishes the ICAME Journal which appears at least once a year, with
articles and information about English computer corpora.
- The Oxford Text Archive (OTA) is the
Arts and Humanities Data Service for Literature, Languages and Linguistics.
Together with members of the academic community, they collect, catalogue,
and preserve high-quality electronic texts for research and teaching. The
OTA currently distributes more than 2000 resources in over 20 different languages,
and is actively working to extend its catalogue of holdings.
- ICE-GB is the
British component of the International Corpus of English (ICE). It is available
on CD-ROM and a 20,000 word sample corpus from ICE-GB is available for download.
- The Collins WordbanksOnline English corpus is composed of 56 million words of contemporary written and spoken text.
- Special Interest Group on the
Lexicon (SIGLEX) provides an umbrella for research interests on lexical
issues ranging from lexicography and the use of online dictionaries to
computational lexical semantics. SIGLEX is also the umbrella organisation
for SENSEVAL, evaluation exercises for Word Sense Disambiguation.
- English Language Interview Corpus as a Second-Language Application (ELISA) is a resource for language learning and teaching, and interpreter training.
- Wortschatz allows a search in 17 Corpus-Based Monolingual Dictionaries, including English.
- The KEMPE corpus of Early Modern Playtexts in English contains text corpora, CG-annotated corpora and treebanks and allows three different search techniques.
- The IViE corpus contains recordings of nine urban dialects of English spoken in the British Isles.
Bio-medical Corpora
- BioInfer is a corpus of 1,100 sentences from
biological research articles, annotated with 3 levels of information, i.e.
- syntactic dependencies
- named entitites
- relationships betwen entitites
- PennBioIE is
corpus of 2257 annotated abstracts from PubMed, annotated with the following:
- paragraph
- sentence
- tokenization
- Part of speech
- syntactic annotation (642 abstracts)
- Biomedical entitites
-
AIMed is corpus of 1955 sentences
from MEDLINE abstracts annotated with gene/protein names and protein-protein interactions.
- TREC Genomics Track provides
a forum for evaluation of information retrieval systems in the genomics domain.
The data from each track is available to everyone (a data usage agreement
must be signed to access the document collections).
- BioMedCentral's
corpus has so far published 248830 articles of peer-reviewed biomedical
research, all of which are covered by their open access license agreement
which allows free distribution and re-use of the full text article, including
the highly structured XML version.
- PathBinder is
a collection of sentences extracted from MEDLINE. Every sentence contains
2 or more different biomolecules. A dictionary of 40,000 biomolecules (80,000
names) were used to scan against all MEDLINE abstracts. The sentences are
organised in a 2-level indexed structure.
- Benchmarks
and Corpora for BioNLP is a list of corpora used for natural-language
processing and text mining in the biomedical domain (compiled by Joerg
Hakenberg).