GENIA corpus - Linguistic and Semantic Annotation of Biomedical Literature


Jin-Dong Kim
(GENIA, University of Tokyo)

The GENIA corpus is a collection of text documents which are abstracts of journal articles on molecular biology. The corpus has been annotated for a wide spectrum of information represented in the text. This has been done from two perspectives. First, biological knowledge pieces delivered by the text have been annotated, covering biological entities and events. Second, linguistic structures underlying the text have been annotated. This type of annotation includes part-of-speech of words and syntactic structure of sentences. It is expected that by approaching from the two perspectives, linguistic structures encoding knowledge pieces could be figured out. In this presentation, the GENIA corpus is introduced with a primary focus on semantic annotation.