The use of the GREC corpus is subject to NaCTeM's Terms and Conditions, and in particular Section 8, regarding the use of NLM databases.
The directory contains 3 subdirectories:
The XML annotation format for the corpus is based on the GENIA event annotation format with some minor modifications.
Two levels of annotation of the target text are expressed within each file, i.e.
An example of an annotated sentence within the XML file is shown below:
... <sentence id="S7"> <term sem="SPAN" id="T10" lex="The_loss">The loss</term> of TreR function led to derepression of <term sem="Gene" id="T11" lex="treB">treB</term> encoding <term sem="SPAN" id="T12" lex="an_enzymeIITre">an enzymeIITre</term> of the PTS for trehalose and of <term sem="Gene" id="T13" lex="treC">treC</term> encoding <term sem="Enzyme" id="T14" lex="TreC">TreC</term> , the cytoplasmic trehalose-6-phosphate hydrolase. </sentence> <event id="E6"> <type class="GRE" /> <Agent idref="T10" /> <Theme idref="E7" /> <clue>The loss of TreR function <clueType>led</clueType> to derepression of treB encoding an enzymeIITre of the PTS for trehalose and of treC encoding TreC, the cytoplasmic trehalose-6-phosphate hydrolase.</clue> </event> <event id="E7"> <type class="Gene_Activation" /> <Theme idref="T11" idref1="T13" /> <clue>The loss of TreR function led to <clueType>derepression</clueType> of treB encoding an enzymeIITre of the PTS for trehalose and of treC encoding TreC, the cytoplasmic trehalose-6-phosphate hydrolase.</clue> </event> ...
Each sentence of the abstract is contained within a <sentence> element. Biological concepts and other event arguments are annotated inline, indicated by <term> elements. Event arguments may or may not correspond to biological concepts. In other cases, a biological concept may form only part of an event argument. Elements of type <term> correspond to all annotated biological concepts, together with other text spans that consistute event arguments.
Each <term> element has the following attributes:
Following the <sentence> element, the events in the sentence are listed, each within an <event> element. Each event has a unique id, starting with an "E". Within the <event> element, there are the following elements:
The idref attribute is always present, whilst idref1, idref2 etc. are only present if the event argument corresponds to two or more discontinuous spans of text. This is the case, for example, when an argument consists of a list of items, the annotator is required to annotate discontinuous spans, consisting of the items in the list, minus any conjunctions or punctuation. In event E7 above, the THEME of derepression consists of the two spans treB and treC, which are assigned the ids of T11 annd T13 respectively. In the Theme element of the event, the value of the attribute idref is this T11, whilst the value of idref1 is T13.