The GREC Corpus
- 01/08/2011: A file in the Human part of the corpus (ID 8205615) was found to contain one character offset problem for a named entity. This problem has now been resolved, and the appropriate standoff annotation file (8205615.a1) has been replaced. The U-Compare corpus reader has also been updated accordingly.
- 2/12/2010: A corpus reader component for the GREC corpus is now available for use in the U-Compare text mining/natural language processing system. The component should be saved and imported into U-Compare, following the steps outlined here.
- 12/11/2010: Three files in the E. coli part of the corpus (IDs 9852003, 14996803 and 15995204) were found to contain minor errors. 1499603 contained an event that was not centred on a verb or nominalised verb, so the event was deleted. The abstracts 9852003 and 15995204 each contained one event in which a type had been assigned to the event itself which should have been assigned to one of the event arguments. This has now been corrected in both files.
- 10/12/2009: Two files in the Human part of the corpus (IDs 8205615 and 9778250) were found to have character offset problems due to foreign characters. Offsets in these files were previously based on bytes rather than characters. The corpus has now been updated so that all offsets are based on characters.
The corpus in available for download in 2 formats:
- A standoff format, based on the BioNLP'09 Shared Task format
- An XML format, based on the GENIA event annotation format
Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. Effective IE systems require training data or annotated corpora, in which instances of biomedical events are explicitly identified in texts. The trained IE systems can then recognise instances of new events in texts, facilitating a number of text mining applications, such as pathway maintenance and semantic searching.
The GREC corpus is a semantically annotated corpus of 240 MEDLINE abstracts (167 on the subject of E. coli species and 73 on the subject of the Human species) which is intended for training IE systems and/or resources which are used to extract events from biomedical literature.
The corpus has been manually annotated with events relating to gene regulation by biologists. Each event is centred on either a verb (e.g. transcribe) or nominalised verb (e.g. transcription) and annotation consists of identifying, as exhaustively as possible, the structually-related arguments of the verb or nominalised verb within the same sentence. Each event argument is then assigned the following information:
- A semantic role from a fixed set of 13 roles which are tailored to the biomedical domain.
- A biomedical concept type (where appropriate).
As a simple example, consider the following sentence:
The narL gene product activates the nitrate reductase operon
The sentence contains a single event, centred on the verb activates, with 2 arguments, i.e.:
- The narL gene product
- the nitrate reductase operon
Other types of argument include:
- LOCATION, e.g. In Escherichia Coli, glnAP2 may be activated by NifA
- MANNER, e.g. cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR
- CONDITION, e.g. Strains carrying a mutation in the crp structural gene fail to repress ODC and ADC activities in response to increased cAMP
Full details of the annotation scheme can be found in the annotation guidelines.
1. Copyright of abstracts
NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.
NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.
2. Copyright of annotations
The annotations within the abstracts of the GREC corpus are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. The annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Please attribute the corpus by citing the following paper:
Thompson, P., Iqbal, S. A., McNaught, J. and Ananiadou, S. (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 10:349
ContactFor any queries relating to the corpus, please contact:
paul.thompson at manchester.ac.uk
- Job Opportunities at NaCTeM
- New article on chemical named entity recognition
- Talk at conference for Civil Service ICT professionals
- New highly accessed article on using text mining in systematic reviews
- Prof. Jun'ichi Tsujii named ACL Fellow
- New article on descriptive document clustering
- UK legislation change on quoting copyrighted Material
Other News & Events
- BioNLP 2015 - call for papers
- Talk at British Library Labs event: Text Mining: Tools and Opportunities
- Talk at Applications of Bioinformatics in Molecular Biology Symposium, Crete
- Invited Talk at the Conference of European Statistics Stakeholders, Rome
- Talk at London Info International