BioNLP'09 Shared Task on Event Extraction
in conjunction with BioNLP, a NAACL-HLT 2009 workshop, June 4-5 2009, Boulder, Colorado

Evaluation for BioNLP'09 Shared Task

The evaluation for the BioNLP'09 Shared Task is based on the equality of events as defined below. That is, each submitted event will be judged either as correct or incorrect as a whole (as opposed to e.g. measuring each event argument assignment separately). Evaluation results are reported using the standard precision/recall/f-score metrics.

The evaluation thus places an emphasis on getting entire events right, as opposed to just those arguments that can be predicted most confidently.

Event equality

There are several aspects to the equality of events, including event type, the identification of the words expressing the event (event trigger expression), the event participants and arguments, and, in turn, the correctness of the entities and events that these refer to. We will apply a number of different correctness criteria:

Detailed definitions are given below. Note that all criteria require the type of the event to be correct and that all participants and arguments are correct. Combinations of the criteria may also be considered.

Strict equality

The strict equality criteria require that for a submitted event to match a gold standard event:

(In (3), "for each event argument" should be understood to refer to both the answer and gold, and "matching argument" to gold or answer (resp.): there can be no extra or missing arguments.)

Two entity / trigger expression spans (beg1, end1) and (beg2, end2), are the same iff beg1 = beg2 and end1 = end2.

Although strict equality serves as the basis of the evalution criteria, considering the complexity of the problem and some of the features of the data, it may be viewed as impratically strict. We therefore provide also the relaxed evaluation criteria which are defined considering the value of extracted information from a practical point of view.

Approximate span matching

In detail, with differences to strict criteria in bold:

For approximate matching, equivalent is defined as follows: a given span is equivalent to a gold span if it is entirely contained within an extension of the gold span by one word both to the left and to the right, that is, beg1 >= ebeg2 and end1 <= eend2, where (beg1, end1) is the given span and (ebeg2, eend2) is the extended gold span.

Thus, for example, the given span (underlined) A plays role in [...] is equivalent to the (hypothetical) gold span A plays role in [...] as it is contained in the extended span A plays role in [...].

(Please note that we may still fine-tune this definition of approximate span equivalence to reduce possiblity of abuse.)

Approximate recursive matching

In detail, with difference to strict criteria in bold:

For partial matching, only Theme arguments are considered. Referred events are thus considered to match even if they differ in non-Theme arguments.

Event Decomposition

In this mode, an event with more than one arguments, e.g.

event-type:trigger-id   arg1-type:arg1-id   arg2-type:arg2-id ...

is decomposed into multiple predicate-argument relations, e.g.

event-type:trigger-id   arg1-type:arg1-id
event-type:trigger-id   arg2-type:arg2-id

Each relation is then evaluated as if it is a single-argument event.