Quick Reference on the Format of GENIA Corpus

contact: genia@is.s.u-tokyo.ac.jp


The top element of GENIA corpus is a set containing articles.

<?xml version="1.0"?>
<!DOCTYPE set
    SYSTEM "gpml.dtd">
<set>
<article> ... </article>
<article> ... </article>
...
</set>


Each article in GENIA corpus has its MEDLINE ID, title and abstract in that order.

<article>
<articleinfo>
<bibliomisc>
MEDLIN:82055447</bibliomisc>
</articleinfo>
<title>...</title>
<abstract>
... </abstract>
</article>


All the texts in the titles and the abstracts are segmented into sentences.

<abstract>
<sentence>We have developed a whole cell binding assay with [3H]dexamethasone as the ligand for the measurement of the glucocorticoid receptor (GR) content of normal and malignant human leukocytes.</sentence>
...
<sentence>No obvious correlation was found between the GR content and the phenotype of the cell line nor between the GR content and the in vitro growth inhibition by glucocorticoids.</sentence>
</abstract>


Linguistically meaningful (from the biological perspective) parts of text in the sentences and titles are marked-up as cons (in the meaning of constituent) elements that have semantic descriptions as the value of sem attribute.


A semantic description can be a direct specification to a concept predefined in one of known ontologies.

<cons sem="G#other_name">IL-2 gene expression</cons> and <cons sem="G#other_name"><cons sem="G#protein_molecule">NF-kappa B</cons> activation</cons> through ...


Here, the known ontologies are the ones that have been introduced in the corpus in the following way:

<set>
<import resource="GENIAontology.daml" prefix="G">
...
</set>


Or it can be a predicate-arguments structure enclosed in parenthesis to express complex concepts.

<cons sem="(AND G#protein_molecule G#protein_molecule)"><cons>CD2</cons> and <cons>CD25</cons> <cons>receptors</cons></cons>


Some cons elements don't have sem attribute in the case that the textual parts cannot be identified with any entries from known ontologies.

<cons sem="(AND G#protein_molecule G#protein_molecule)"><cons>CD2</cons> and <cons>CD25</cons> <cons>receptors</cons></cons>


written by Jin-Dong Kim (mail@jdkim.net), April 3, 2003