The top element of GENIA corpus is a set containing articles.
<?xml version="1.0"?>
<!DOCTYPE set
SYSTEM "gpml.dtd">
<set>
<article>
...
</article>
<article>
...
</article>
...
</set>
Each article in GENIA corpus has its MEDLINE ID, title and abstract in that order.
<article>
<articleinfo>
<bibliomisc>MEDLIN:82055447</bibliomisc>
</articleinfo>
<title>...</title>
<abstract>
...
</abstract>
</article>
All the texts in the titles and the abstracts are segmented into sentences.
<abstract>
<sentence>We have developed a whole cell binding assay with [3H]dexamethasone as the ligand for the measurement of the glucocorticoid receptor (GR) content of normal and malignant human leukocytes.</sentence>
...
<sentence>No obvious correlation was found between the GR content and the phenotype of the cell line nor between the GR content and the in vitro growth inhibition by glucocorticoids.</sentence>
</abstract>
Linguistically meaningful (from the biological perspective) parts of text in the sentences and titles are marked-up as cons (in the meaning of constituent) elements that have semantic descriptions as the value of sem attribute.
A semantic description can be a direct specification to a concept predefined in one of known ontologies.
<cons sem="G#other_name">IL-2 gene expression</cons> and <cons sem="G#other_name"><cons sem="G#protein_molecule">NF-kappa B</cons> activation</cons> through ...
Here, the known ontologies are the ones that have been introduced in the corpus in the following way:
<set>
<import resource="GENIAontology.daml" prefix="G">
...
</set>
Or it can be a predicate-arguments structure enclosed in parenthesis to express complex concepts.
<cons sem="(AND G#protein_molecule G#protein_molecule)"><cons>CD2</cons> and <cons>CD25</cons> <cons>receptors</cons></cons>
Some cons elements don't have sem attribute in the case that the textual parts cannot be identified with any entries from known ontologies.
<cons sem="(AND G#protein_molecule G#protein_molecule)"><cons>CD2</cons> and <cons>CD25</cons> <cons>receptors</cons></cons>
written by Jin-Dong Kim (mail@jdkim.net), April 3, 2003