HIMERA Annotation Format

The downloadable corpus consists of:

  • a set of text files, corresponding of the textual parts of the chosen articles/excerpts.
  • a set of annotation files, containing the manually-added annotations associated with each text file.

There are two top-level directories, BMJ and MOH, whilst the BMJ directory is further subdivided according to the decade of the articles, i.e. 1850s, 1890s, 1920s and 1960s.

Files corresponding to BMJ articles are named according to their PMCIDs.

Files corresponding to MOH report excerpts are named with the format [Borough]_[Year]_excerpt.

Annotations are encoded in the BioNLP Shared Task 2013 format. Based on this format, there are two annotation files associated with each text file:

  • a1 files - encode information about entity annotations (and negation cues).
  • a2 files - encode information about event annotations.

a1 Files

In a1 files, each line provides information about a single entity. Examples are shown below:

    T1     Condition 50 84     acute suffocative pulmonary oedema
    T2     Sign_or_Symptom 6612 6618;6630 6642     mitral incompetence
    T9     Negation_Cue 5609 5612	not

Each line consists of:

  • A unique id for the entity. By convention, this starts with T, followed by a numerical value.
  • A TAB character.
  • The entity type assigned to the annotation (or the label Negation_Cue, in the case that the annotated span corresponds to a word or phrase indicating the negation of an event).
  • The character-based offsets of the entity annotation in the corresponding text file. There are two formats for the offsets, depending on whether the annotated span consists of a single, continuous span or a discontinuous span, consisting of multiple, conncted spans. As an example of cases where a discontineuous annotation is needed, consider the text span gouty or rheumatic bronchitis. In this span, there are two conditions mentioned, i.e., gouty bronchitis and rheumatic bronchitis, although the word bronchitis appears only once. In order to annotate gouty bronchitis, it is necessary to annotate the words gouty and bronchitis, and to link them together.
    • For continuous spans (as in the first example line above), there are two offsets, corresponding to the start and end offsets of the span. The first offset is separated by a space from the entity type label, and there is a space between the start and end offsets.
    • For discontinuous spans (as in the second example line above), there are two or more pairs of start and end offsets, each separated by a semi-colon. Each pair of offsets corresponds to a part of the complete annotated span.
  • Another TAB character
  • The text covered by the annotated span in the corresponding text file.

a2 Files

There are three different formats of lines in a2 files, as shown in the example below:

T10	    Affect 3347 3358	gave relief
E11	    Affect:T10 Cause:T63 Subj:T61 Cue:T43
E8	    Causality:T11 Result:T240 Result2:T254 Cause:T241
M14	    Negation E11

The formats of the lines are as follows:
  • Lines starting with T - These correspond to event trigger spans and have exactly the same format as the lines in the a1 files, except that the semantic labels correspond to the relevant event type (i.e, either Causality or Affect). As with entity annotations, the spans may be discontinuous.
  • Lines starting with E - These correspond to event annotations. They consist of the following parts:
    • A unique id for the event. By convention, this starts with an E, followed by a numerical value.
    • A TAB character.
    • The semantic type assigned to the event, followed by a colon, and the ID assigned to the event trigger span
    • A sequence of pairs of the format [Label]:[ID], separated by spaces. Each pair corresponds either to an event participant, in which case the [Label] part of the pair is the semantic role assigned to the participant, or to a negation cue, in which case the the [Label] part has the value Cue. The [ID] part may start with a T, in which case it corresponds to an an entity annotation in the associated a1 file, or, it may start with an E, in which case the participant corresponds to another event listed within the same a2 file. If more than one participant is assigned the same semantic role, then for the second and subsequent participants, a number is appended to the semantic role label, e.g., Result2 for the second participant assigned the Result role, as in the third example line above.
  • Lines starting with M - These correspond to attributes or modifications assigned to events. In the case of HIMERA, the only such modification possible is for an event to be negated. These lines consist of the following parts:
    • A unique id for the negation. By convention, this starts with an M, followed by a numerical value.
    • A TAB character.
    • The label Negation, followed by a space and then the ID of the event is negated