The GREC Corpus



  • 01/08/2011: A file in the Human part of the corpus (ID 8205615) was found to contain one character offset problem for a named entity. This problem has now been resolved, and the appropriate standoff annotation file (8205615.a1) has been replaced. The U-Compare corpus reader has also been updated accordingly.
  • 2/12/2010: A corpus reader component for the GREC corpus is now available for use in the U-Compare text mining/natural language processing system. The component should be saved and imported into U-Compare, following the steps outlined here.
  • 12/11/2010: Three files in the E. coli part of the corpus (IDs 9852003, 14996803 and 15995204) were found to contain minor errors. 1499603 contained an event that was not centred on a verb or nominalised verb, so the event was deleted. The abstracts 9852003 and 15995204 each contained one event in which a type had been assigned to the event itself which should have been assigned to one of the event arguments. This has now been corrected in both files.
  • 10/12/2009: Two files in the Human part of the corpus (IDs 8205615 and 9778250) were found to have character offset problems due to foreign characters. Offsets in these files were previously based on bytes rather than characters. The corpus has now been updated so that all offsets are based on characters.

The corpus in available for download in 2 formats:

The annotation guidelines are also available to download.


Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. Effective IE systems require training data or annotated corpora, in which instances of biomedical events are explicitly identified in texts. The trained IE systems can then recognise instances of new events in texts, facilitating a number of text mining applications, such as pathway maintenance and semantic searching.

The Corpus

The GREC corpus is a semantically annotated corpus of 240 MEDLINE abstracts (167 on the subject of E. coli species and 73 on the subject of the Human species) which is intended for training IE systems and/or resources which are used to extract events from biomedical literature.

The corpus has been manually annotated with events relating to gene regulation by biologists. Each event is centred on either a verb (e.g. transcribe) or nominalised verb (e.g. transcription) and annotation consists of identifying, as exhaustively as possible, the structually-related arguments of the verb or nominalised verb within the same sentence. Each event argument is then assigned the following information:

  • A semantic role from a fixed set of 13 roles which are tailored to the biomedical domain.
  • A biomedical concept type (where appropriate).

As a simple example, consider the following sentence:

The narL gene product activates the nitrate reductase operon

The sentence contains a single event, centred on the verb activates, with 2 arguments, i.e.:

  1. The narL gene product
  2. the nitrate reductase operon
The argument The narL gene product is assigned the semantic role AGENT and the biological concept Protein, whilst the argument the nitrate reductase operon is assigned the semantic role THEME and the biological concept Operon.

Other types of argument include:

  • LOCATION, e.g. In Escherichia Coli, glnAP2 may be activated by NifA
  • MANNER, e.g. cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR
  • CONDITION, e.g. Strains carrying a mutation in the crp structural gene fail to repress ODC and ADC activities in response to increased cAMP

Full details of the annotation scheme can be found in the annotation guidelines.

GREC Licence

1. Copyright of abstracts

The abstracts contained in the GREC corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).

NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.

NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.

2. Copyright of annotations

Creative Commons License
The annotations within the abstracts of the GREC corpus are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. The annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Please attribute the corpus by citing the following paper:

Thompson, P., Iqbal, S. A., McNaught, J. and Ananiadou, S. (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 10:349


For any queries relating to the corpus, please contact:
paul.thompson at