Seminar — Jian Su

Speaker: Dr Jian Su, Institute for Infocomm Research (I2R), Singapore
Title: 1) Coreference Resolution in Biology Literature: a Machine Learning Approach
2) An effective method of using Web based information for Relation Extraction
Date: 23rd April 2008 at 12:00
Location: Room MLG.001 (Lecture Theatre) in the MIB Building

1) Coreference Resolution

Coreference resolution, the process of identifying different mentions of an entity, is a very important technology in a text-mining system. Only with it, text mining systems such as a Protein Protein Interaction extraction system, could capture and link information expressed with norminal mentions (eg. "this protein") and pronoun mentions (eg besides "it") name mentions (eg, "P50"). Compared with the work in news articles, the existing study of coreference resolution in biomedical texts is quite preliminary by only focusing on specific types of anaphors like pronouns or definite noun phrases, using heuristic methods, and running on small data sets. Therefore, there is a need for an in-depth exploration of this task in the biomedical domain.

In this talk, I'll present a learning-based approach to coreference resolution in the biomedical domain. In this study, we annotated a large scale coreference corpus, MedCo, which consists of 1,999 medline abstracts in the GENIA data set. We further proposed a detailed framework for the coreference resolution task, in which we augmented the traditional learning model by incorporating non-anaphors into training. Besides, we also explored various sources of knowledge for coreference resolution, particularly, those that can deal with the complexity of biomedical texts.

The evaluation on our corpus showed promising results. We achieved a high precision of 86.2% with a reasonable recall of 63.9%, obtaining an F-measure of 73.4%. The results also suggested that our augmented learning model significantly boosts precision (up to 23.7%) without much loss in recall (less than 5%), which brings a gain of 8% in F-measure.

2) Relation Extraction

In this talk, I'll address our method that incorporates paraphrase information from the Web to boost the performance of a supervised relation extraction system. Contextual information is extracted from the Web using a semi-supervised process, and summarized by skip-bigram overlap measures over the entire extract. This allows the capture of local contextual information as well as more distant associations. A statistically significant boost in relation extraction performance is observed.

Two extensions, thematic clustering and hypernym expansion are investigated. In tandem with thematic clustering to reduce noise in the paraphrase extraction, we attempt to increase the coverage of our search for paraphrases using hypernym expansion. Evaluation of our method on the ACE 2004 corpus shows that it out-performs the baseline SVM-based supervised learning algorithm across almost all major ACE relation types, by a margin of up to 31%. This approach could be extend to relation extraction in biology literature, such as protein protein interaction extraction as well.