NaCTeM

ACE Meta-knowledge annotation format

This page provides details of the following:

  • The custom XML format that has been developed to encode the meta-knowledge information annotated for each event in the English part of ACE 2005 corpus
  • The proposed combined XML format that integrates information from the original English ACE 2005 annotation effort with the newly annotated meta-knowledge information.
  • How to run the Java program to automatically merge the new meta-knowledge annotation information with the original annotations, according to the proposed, integrated XML format.

Custom XML format to encode meta-knowledge information about events in ACE 2005

The format is used to encode meta-knowledge information annotated for each event in the ACE 2005 corpus. It is formally defined in the DTD file add.dtd, which is included as part of the download.

The meta-knowledge annotations are provided in the ACE-MK directory, which can be downloaded. This directory itself has six subdirectories, which correspond to the same split of the documents as is provided in the ACE 2005 English corpus data. Each of the six subdirectories, i.e. bn, bc, nw, cts, wl, un, corresponds to a different data source from which the documents were originally drawn.

The meta-knowledge annotation files within each of the subdirectories ends with the extension .add.xml. The base name of each file matches that of underlying .sgm file, corresponding to the original document in the ACE 2005 corpus, as well as the original annotation file in ACE 2005 (with the extension .apf.xml) which the meta-knowledge annotation file can be seen to augment. These original annotation files are the ones contained within the timex2norm subdirectory of each data source directory within the original ACE 2005 distribution, (e.g. bn/timex2norm), since these are considered to be the final, consolidated annotation files.

An example of the custom XML format used to encode meta-knowledge information is shown below, after which descriptions of the different types of elements are provided.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE source_file SYSTEM "add.dtd">
<source_file AUTHOR="LDC" ENCODING="UTF-8" SOURCE="broadcast news" TYPE="text"
    URI="CNN_ENG_20030312_083725.3.sgm">
  <document DOCID="CNN_ENG_20030312_083725.3">
    <mk-cue ID="CNN_ENG_20030312_083725.3-C1" TYPE="SourceType-Cue">
      <extent>
        <charseq END="525" START="522">said</charseq>
      </extent>
    </mk-cue>
    <mk-source ID="CNN_ENG_20030312_083725.3-C2">
      <extent>
        <charseq END="520" START="506">earlier reports</charseq>
      </extent>
    </mk-source>
    <mk-cue ID="CNN_ENG_20030312_083725.3-C4" TYPE="Modality-Cue">
      <extent>
        <charseq END="247" START="238">apparently</charseq>
      </extent>
    </mk-cue>
    <event_mention ID="CNN_ENG_20030312_083725.3-EV3-1" MK-GENERICITY="Specific"
                      MK-MODALITY="Speculated" MK-POLARITY="Positive"
                      MK-SOURCE-TYPE="Author" MK-SUBJECTIVITY="Neutral"
                      MK-TENSE="Past">
      <event_mention_mk_evidence EVIDENCE-TYPE="MODALITY-CUE"
                                    REFID="CNN_ENG_20030312_083725.3-C4">
        <extent>
          <charseq END="247" START="238">apparently</charseq>
        </extent>
      </event_mention_mk_evidence>
    </event_mention>
    <event_mention ID="CNN_ENG_20030312_083725.3-EV8-1" MK-GENERICITY="Specific"
                      MK-MODALITY="Asserted" MK-POLARITY="Positive"
                      MK-SOURCE-TYPE="ThirdParty" MK-SUBJECTIVITY="Neutral"
                      MK-TENSE="Past">
      <event_mention_mk_evidence EVIDENCE-TYPE="SOURCE-UNNAMED"
                                    REFID="CNN_ENG_20030312_083725.3-C2">
        <extent>
          <charseq END="520" START="506">earlier reports</charseq>
        </extent>
      </event_mention_mk_evidence>
      <event_mention_mk_evidence EVIDENCE-TYPE="SOURCETYPE-CUE"
                                    REFID="CNN_ENG_20030312_083725.3-C1">
        <extent>
          <charseq END="525" START="522">said</charseq>
        </extent>
      </event_mention_mk_evidence>
    </event_mention>
  </document>
</source_file>

The source_file and document elements match those at the start of the original corresponding ACE annotation file (with the extension .apf.xml), and they provide information about the underlying text document. The information added during the meta-knowledge annotation effort is encoded in the following elements:

mk-cue

These elements encode information about meta-knowledge cue words and phrases (i.e., evidence for the assignment of particular meta-knowledge values) that have been identified during annotation.

Attributes

  • ID - A unique ID for the cue
  • Type - The type of the cue, i.e., the type of meta-knowledge attribute for which it provides evidence. The value of this attribute is one of the following: Subjectivity-Cue, Modality-Cue, Tense-Cue, Genericity-Cue, SourceType-Cue

Children

  • A single extent element always appears as child, which is used to denote the span covered in the corresponding text file.

mk-source

These elements correspond to phrases in the text that correspond to information sources of events, and which have been identified during the meta-knowledge annotation effort. Note, however, that certain sources correspond to entities that were annotated during the original ACE annotation effort (e.g., people or organisations providing information). The mk-source elements only encode sources that do not correspond to originally annotated entities. Often, these newly annotated sources are vague phrases, such as reports, that correspond to unnamed sources.

Attributes

  • ID - A unique ID for the source phrase

Children

  • A single extent element always appears as child, which is used to denote the span covered in the corresponding text file.

event_mention

The original ACE 2005 annotation includes both event and event_mention annotations. A particular event (e.g., a death or attack) may be mentioned several times in a document. The event elements group together these different mentions of the same event (i.e., an event element can have one or more event_mention elements as its children). In the original ACE 2005 annotation, meta-knowledge attributes (i.e., MODALITY, POLARITY, TENSE and GENERICITY) were encoded at the level of event elements, i.e. the values of these attributes were expected to be the same for all mentions of the events.

However, for our augmented meta-knowledge annotation, the meta-knowledge information is more appropriately attached at the level of event mentions. This is because, for example, different sources could be providing different information about the same event. Each of these sources may in turn have different opinions about the event. Such factors mean that, when applying our augmented and more fine-grained meta-knowledge annotation scheme, it is far more appropriate to assign meta-knowledge information to each individual mention of an event.

Each event_mention element in the .add.xml files corresponds to an event_mention element in the associated .apf.xml file within the ACE 2005 corpus. The information provided within each event_mention element within the add.xml file is intended to be used to augment/enrich the corresponding event_mention element in the .apf.xml file.

Attributes

  • ID - The ID for the event mention. This ID corresponds to the ID of an event_mention element in the associated .apf.xml file within the ACE 2005 corpus (i.e. within the timex2norm subdirectory of the corresponding corpus partition.)
  • MK-GENERICITY - The meta-knowledge GENERICITY attribute associated with the event mention. Possible values are Specific and Generic
  • MK-MODALITY - The meta-knowledge MODALITY attribute associated with the event mention. Possible values are Asserted, Presupposed Speculated and Other
  • MK-POLARITY - The meta-knowledge POLARITY attribute associated with the event mention. Possible values are Positive and Negative
  • MK-SOURCE-TYPE - The meta-knowledge SOURCE-TYPE attribute associated with the event mention. Possible values are Author, Involved and ThirdParty
  • MK-SUBJECTIVITY - The meta-knowledge GENERICITY attribute associated with the event mention. Possible values are Positive, Negative Multi-valued and Neutral

Children

  • Zero or more event_mention_mk_evidence elements, each of which links a meta-knowledge related text span to the event mention.

event_mention_mk_evidence

These elements indicate the meta-knowledge related text spans that are associated with their parent event mention elements. Such text spans correspond either to cue word/phrases that provide evidence for the assignment of a particular meta-knowledge attribute value, or to phrases denoting the information source of an event

Attributes

  • EVIDENCE-TYPE - The type of meta-knowledge evidence that the underlying text span provides. This can either be:
    • A cue word/phrase, in which case one of the following values is used: MODALITY-CUE, GENERICITY-CUE, POLARITY-CUE, SUBJECTIVITY-CUE, TENSE-CUE, SOURCETYPE-CUE
    • A word/phrase denoting an information source, in which case one of the following values is used: SOURCE-NAMED if the phrase corresponds to a named source (e.g, a particular person, group or organisation); or SOURCE-UNNAMED, if the phrase is a a more vague reference to an information source, that is not specifically named.
    NOTE: There can be more than one instance of a particular EVIDENCE-TYPE associated with a given event mention. For example, there may be multiple sources, or multiple words/phrases that provide evidence about the modality of an event.
  • REF-ID - An ID that references an existing text span annotation (i.e., the value of this attribute matches the the value of the ID attribute of an extisting element corresponding to a text span annotation). The referenced text span annotation may be one of the following:
    • An mk-cue annotation in the same file
    • An mk-source annotation in the same file
    • An entity-mention annotation in the associated original annotation file (i.e., the file with the extension .apf.xml within the timex2norm subdirectory of the corresponding corpus partition.). As mentioned above, information sources of events may correspond to entities that were previously identified as part of the original ACE 2005 annotation effort (people, organisation, etc). In this case, the REFID value will correspond to the annotation that identifies the mention of the releveant entity within the same sentence.

Children

  • A single extent element always appears as child, which is used to denote the span covered in the corresponding text file.

extent

An element used to denote the span covered in the corresponding text file.

Children

  • A single charseq element, which provides details of the character offsets in the underlying document, and the text span covered.

charseq

An element providing precise details about the text span covered by an annotation. The content of this element is the exact text covered by the annotation.

Attributes

  • START - The character offset in the underlying document file corresponding to the first character covered by the annotation.
  • END - The character offset in the underlying document file corresponding to the last character covered by the annotation.

Integrated XML format combining original ACE 2005 event annotations with detailed meta-knowledge information

An example of part of a file using the proposed integrated format is shown below. The changes are formally defined in the file, which is an extension of the original DTD file used to encode the ACE 2005 annotation.

<source_file AUTHOR="LDC" ENCODING="UTF-8" SOURCE="broadcast conversation" TYPE="text"
                URI="CNN_CF_20030303.1900.00.sgm">
<document DOCID="CNN_CF_20030303.1900.00">
<entity CLASS="SPC" ID="CNN_CF_20030303.1900.00-E75" SUBTYPE="Nation" TYPE="GPE">
  <entity_mention ID="CNN_CF_20030303.1900.00-E75-142" LDCTYPE="NAM" ROLE="LOC" TYPE="NAM">
    <extent>
      <charseq END="3835" START="3832">Iraq</charseq>
    </extent>
    <head>
      <charseq END="3835" START="3832">Iraq</charseq>
    </head>
  </entity_mention>
  <entity_attributes>
    <name NAME="Iraq">
      <charseq END="3835" START="3832">Iraq</charseq>
    </name>
  </entity_attributes>
</entity>
 <mk-cue ID="CNN_CF_20030303.1900.00-C1" TYPE="Modality-Cue">
      <extent>
        <charseq END="3814" START="3813">if</charseq>
      </extent>
</mk-cue>          
<event GENERICITY="Generic" ID="CNN_CF_20030303.1900.00-EV8" MODALITY="Other" POLARITY="Positive"
           SUBTYPE="Attack" TENSE="Future" TYPE="Conflict">
  <event_argument REFID="CNN_CF_20030303.1900.00-E3" ROLE="Attacker"/>
  <event_argument REFID="CNN_CF_20030303.1900.00-E75" ROLE="Place"/>
  <event_mention ID="CNN_CF_20030303.1900.00-EV8-1" MK-GENERICITY="Specific"
                    MK-MODALITY="Speculated" MK-POLARITY="Positive" MK-SOURCE-TYPE="Author"
                    MK-SUBJECTIVITY="Neutral" MK-TENSE="Future">
    <extent>
      <charseq END="3835" START="3816">we go to war in Iraq</charseq>
    </extent>
    <ldc_scope>
      <charseq END="3835" START="3771">It could swell to as
much as $500 billion if we go to war in Iraq</charseq>
    </ldc_scope>
    <anchor>
      <charseq END="3827" START="3825">war</charseq>
    </anchor>
    <event_mention_argument REFID="CNN_CF_20030303.1900.00-E3-141" ROLE="Attacker">
      <extent>
        <charseq END="3817" START="3816">we</charseq>
      </extent>
    </event_mention_argument>
    <event_mention_argument REFID="CNN_CF_20030303.1900.00-E75-142" ROLE="Place">
      <extent>
        <charseq END="3835" START="3832">Iraq</charseq>
      </extent>
    </event_mention_argument>
  <event_mention_mk_evidence EVIDENCE-TYPE="MODALITY-CUE" REFID="CNN_CF_20030303.1900.00-C1">
        <extent>
          <charseq END="3814" START="3813">if</charseq>
        </extent>
      </event_mention_mk_evidence>
      </event_mention>
</event>

The main points to note about this format are the the following:

  • mk-cue and mk-source elements are added as children of the document element, i.e., they are siblings of existing entity, timex2, relation and event elements.
  • The attributes MK-GENERICITY, MK-MODALITY, MK-POLARITY, MK-SOURCE-TYPE, MK-SUBJECTIVITY and MK-TENSE are added to each event_mention element, to provide detailed meta-knowledge information about each individual mention of an event.
  • Zero or more event_mention_mk_evidence elements may occur as children of an event_mention element (and thus as siblings of any event_mention_argument elements), to encode meta-knowledge related text spans that are specifically associated with the event mention.

Java program to combine meta-knowledge XML files with original ACE 2005 English annotation files

We provide a Java program (CombineMKWithACE.java) that will automatically combine information in the meta-knowledge XML annotation files with the annotation in the original ACE 2005 English XML annotation files.

Instructions

  1. Compile the program:
    javac CombineMKWithACE.java
  2. Run the program, which takes three arguments
    java CombineMKWithACE.java PATH_TO_ENGLISH_ACE_DATA PATH_TO_MK_DATA PATH_TO_OUTPUT_DIRECTORY

    The three arguments are as follows:
    • PATH_TO_ENGLISH_ACE_DATA - This is the path to the directory containing the English data in the ACE 2005 distribution. For example /Users/paul/ace_2005_td_v7/data/English.
      NOTE: The structure of the English directory is expected to be exactly in the format provided by the LDC. Specifically, it is expected that the subdirectories bn, bc, nw, cts, wl and un will exist within the English directory, and that each of these subdirectories will contain a timex2norm subdirectory.
    • PATH_TO_MK_DATA - This is the path to the downloaded and unpacked directory (named ACE-MK), containing the meta-knowledge XML files, for example /Users/paul/ACE-MK
    • PATH_TO_OUTPUT_DIRECTORY - This is the path to the directory where it is desired that the integrated XML files resulting from the application of the program will be stored. For example, /Users/paul/COMBINED-ACE-MK. NOTE: The output directory may be an existing directory, otherwise, the program will create it.
As a result of running the program, sub-directories corresponding to each part of the corpus (i.e., bn, bc, nw, cts, wl and un will be created, if they do not exist already). Within each of these sub-directories, an XML file with the extension .apf.mk.xml will be created for each of the original XML annotation files, which combines the original annotation information with the meta-knowledge annotation information.