BioNLP'09 Shared Task on Event Extraction
in conjunction with BioNLP, a NAACL-HLT 2009 workshop, June 4-5 2009, Boulder, Colorado
NOTICE:

Proteins and other physical entities

For the shared task, the gold annotations for named entities of the protein, gene and RNA types will be provided, in both the training and test data. The main task is to find molecular events which involve these entities as their primary participants, themes and causes. In the provided gold entity annotation, proteins, genes and RNAs are not differentiated. Strictly speaking, the gold entity annotation thus identifies named genes or gene products. Below, we will refer to these named entities simply as proteins. Note that we will restrict the protein annotation only to entities that are expected to be registered in a protein or gene database (e.g. UniProt or Entrez gene). Thus, the names of protein families, e.g. protein kinase C (PKC), or protein complexes, e.g. NF-kappa B, will not be included in the annotation, and they are considered out of scope for the shared task. Consequently, events involving only protein families or complexes are also not considered in the shared task.

Note that the provided entity annotation only covers genes and gene products. These entities are sufficient to fulfill task 1. However, for task 2, participants will need to recognize other entity mentions. For example, in order to specify the location in a localization event, participants need to recognize cellular locations.

Event Definition

The data for the shared task will be prepared based on the GENIA corpus, which has been manually annotated for bio-events. The following event types, selected on the basis of their frequency and annotation quality in the corpus, will be targeted in the shared task. For the biological interpretation of these events types, we follow the definitions given in the Gene Ontology.

(1) Gene_expression
(2) Transcription
(3) Protein_catabolism

These four event classes relate to protein production and breakdown. An event of these classes is described with a protein participant in the Theme slot.

Participants Theme
Arguments -

(4) Localization

Represents a change of the location or presence of a protein. A localization event is described with one Theme (Task (1)) participant and, when stated, either an AtLoc argument (for events specifying where a protein is located) or a ToLoc argument (for events specifying where a protein moves to) (Task (2)).

Participants Theme
Arguments ToLoc, AtLoc

(5) Binding

Includes the binding of two or more proteins (including homodimerization), and the binding of a protein and DNA. Binding events are described using Theme slots specifying the participants (Task (1)) and further specified with Site slots identifying the protein or DNA regions involved in the binding (Task (2)), when stated.

Participants Theme, Theme2, Theme3, ...
Arguments Site, Site2, Site3

Note that the numbering of the Theme and Site slots is arbitrary and only used to connect the Sites with their specific Themes. Thus, for example, a Binding event involving two proteins, A and B, with the binding region, Ra, on A stated may be annotated either with (Theme:A, Theme2:B, Site:Ra) or with (Theme:B, Theme2:A, Site2:Ra). Participants are free to assign these numbers as they see fit.

(6) Phosphorylation

A phosphorylation event is described with a protein participant in the Theme slot (Task (1)) and a site attribute in the Site slot (Task (2)).

Participants Theme
Arguments Site

(7) Regulation
(8) Positive_regulation
(9) Negative_regulation

Represents a regulatory or causal relation between the above event classes or proteins. A regulation event is described with one Theme and one Cause slot (Task (1)). This event class is further classified into Positive_regulation and Negative_regulation. Similarly to Binding events (see above), when the Theme or Cause argument is a protein and a specific domain or region of that protein is stated as being involved in the regulation, the domain is specified using the Site argument (for Theme) or the CSite argument (for Cause), as appropriate (Task (2)).

Participants Theme, Cause
Arguments Site, CSite

(see also the question "Why are no agents specified for events?" in the FAQ.)

Format

(1) The target text files

RFLAT-1: a new zinc finger transcription factor that activates RANTES gene expression in T lymphocytes.
RANTES (Regulated upon Activation, Normal T cell Expressed and Secreted) is a chemoattractant cytokine (chemokine) important in the generation of inflammatory infiltrate and human immunodeficiency virus entry into immune cells. RANTES is expressed late (3-5 days) after activation in T lymphocytes.

The target texts will be given as plain text files with ASCII characters and UNIX-style newline convention. Each target text file has two lines: one for the title, and another for the abstract. Note that sentence segmentation is not provided and events may involve entities occurring in different sentences. The target text file are named with the suffix '.txt'.

(2) The annotation files

T3 Protein 0 7 RFLAT-1
T4 Protein 63 69 RANTES
T5 Protein 104 110 RANTES
T6 Protein 112 175 Regulated upon [...] Secreted
* Equiv T5 T6
...
T7 Gene_expression 75 85 expression
E1 Gene_expression:T7 Theme:T4
T8 Positive_regulation 53 62 activates
E2 Positive_regulation:T8 Theme:E1 Cause:T3
...

Various levels of annotation of the target text are expressed using stand-off style annotations, stored separately from the target text files. In the stand-off annotation files, each annotation is specified on a separate line.

There are four types of annotation:

(Note that relation annotation is not included as a subtask in the shared task and is only provided as supporting information in the training data.)

Excepting relations, each annotation is given its own unique ID. The ID occurs first on each annotation line, delimited from the rest of the annotation with a TAB character. The text-bound annotations (entities and event triggers) are given as SPACE-separated triples (entity-type, offset-begin, offset-end). The offset-begin is the index of the first character in the entity, i.e. the number of characters in the document preceding it. The offset-end is the index of the first character after the entity. Thus, the character in the offset-end position is not included in the entity. In order to improve readability, the text span specified by offset-begin and offset-end is attached at the end of each annotation with a separating TAB character. Note that this is only provided for improving readability and participants do not need to produce the last column (text span) for the evaluation data.

For the annotation of events, a frame-like format is used where each annotation is expressed as a SPACE-separated n-tuple (event-class, argument1, argument2, ...).

Participants are required to produce the annotations for events. This thus involves producing two different types of annotation: the recognition of event trigger words from text, using the text-bound annotation format, and the association of the annotated entities and event triggers to express an event, using the frame-like format.

The relation annotation identifies equivalent entities in the training data. For equivalent entities participating in events, either of the equivalent entities can be specified. Thus, in the above example, any reference to T5 could be replaced with a T6 without changing the interpretation of the annotation.

See also the following questions in the FAQ:

Examples

Example 1

(1) Target text (finame.txt)

TRADD was the only protein that interacted with wild-type TES2 and not with isoleucine-mutated TES2.

The target text is stored in a file with the suffix, '.txt'.

(2) Protein annotation (filename.a1)

T1 Protein 0 5 TRADD
T2 Protein 58 62 TES2
T3 Protein 95 99 TES2

Protein annotations for the target text file are stored in a file with the same filename and the suffix '.a1'. Protein annotation files are provided to the parcticipants together with the target text files.

(3) Event annotation corresponding to Task 1 (filename.a2.t1)

T4 Binding 32 42 interacted
E1 Binding:T4 Theme:T1 Theme2:T2
E2 Binding:T4 Theme:T1 Theme2:T3

Annotations which are supposed to be addressed by the three tasks are stored in the files named with the suffix '.a2', plus another suffix indicating the specific tasks. For example, the files containing the annotations concerned in Task 1 are given the suffix '.a2.t1'.

[New] Note that the IDs of protein annotations provided in *.a1 files have to be preserved and referenced to as they are by other annotations in *.a2.* files. The IDs of annotations produced by participants can be freely chosen as long as they have proper prefix ("T" event triggers, "E" for entities, etc.) and they are unique in the *.a2.* file and its corresponding *.a1 file. In the above example, the IDs, T1, T2 and T3 have to be preserved, but the IDs for the three annotations in the file filename.a2.t1 can be freely chosen, e.g. T100, E14, E1.

In the above example, although the two instances of TES2 (T2 and T3) are different in their state (wild-type one and isoleucine-mutated one), we do not attempt to capture the state of entities in this shared task. However, by recognizing named entities in a text-bound way, we differentiate them and leave open the chance to investigate the surrounding context to capture their state. Note also that in Task (1), the explicitly negated statement is annotated identically to the affirmative one.

(4) Event annotation corresponding to Task 2 (filename.a2.t12)

(Same as for Task 1)

Since there are no additional arguments expressed in the text, for the event recognized in task 1, there is nothing to add for the Task 2 other than the information extracted for Task 1. In this case, participants in Task 2, who also must perform Task 1, can just submit the same files as for task 1, but have to name them with the suffix '.a2.t12'.

(5) Event annotation corresponding to Task 3 (filename.a2.t123 or filename.a2.t13)

(Same as for Task 1 and Task 2)
M1 Negation E2

The second assertion of binding between TRADD and TES2 (E2) is negated in the text. Participants are required to capture and add it as an annotation to the file named with suffix '.a2.t13', if they are participating in tasks 1 and 3, or '.a2.t123', if they are participating in all tasks.

Example 2

(1) Target text (filename.txt)

In this study we hypothesized that the phosphorylation of TRAF2 inhibits binding to the CD40 cytoplasmic domain. ...

(2) Protein annotation (filename.a1)

T1 Protein 57 62 TRAF2
T2 Protein 88 92 CD40

(3) Event annotation corresponding to Task 1 (filename.a2.t1)

T4 Phosphorylation 39 54 phosphorylation
E1 Phosphorylation:T4 Theme:T1
T5 Binding 73 80 binding
E2 Binding:T5 Theme1:T1 Theme2:T2
T6 Negative_regulation 64 72 inhibits
E3 Negative_regulation:T6 Theme:E2 Cause:E1

(3) Event annotation corresponding to Task 2 (filename.at.t12)

T4 Phosphorylation 39 54 phosphorylation
E1 Phosphorylation:T4 Theme:T1
T5 Binding 73 80 binding
T7 Entity 93 111 cytoplasmic domain
E2 Binding:T5 Theme1:T1 Theme2:T2 Site2:T7
T6 Negative_regulation 64 72 inhibits
E3 Negative_regulation:T6 Theme:E2 Cause:E1

In addition to the core event descriptions recognized and specified in Task 1, participants in Task 2 need to recognize the region of the protein CD40 which is involved in the Binding event (E2) as expressed in the text, and specify it in the event description (expressed in bold face in the above example). Note that participants do not need to explicitly specify the class of the named entities and the generic type "Entity" can be used. More importantly, in the case of binding events, correspondence between the themes and sites is important because more than one protein may be involved in a binding event. In the example above, T7 is specified as the value of the slot Site2 (not Site) because cytoplasmic domain is a part of the CD40 protein, the Theme2 (not Theme) participant.

(4) Event annotation corresponding to Task 3 (filename.a2.t13 or filename.a2.t123)

(Same as in filename.a2.t13 for participants in Task 1, or
 Same as in filename.a2.t123 for participants in Task 1, 2 and 3)
M1 Speculation E3

The negative regulation (E3) of the binding event (E2) caused by the phosphorylation event (E1) is stated as speculation in the text.