Dictionary-based Approaches for Biomedical Term Recognition

Yoshimasa Tsuruoka

Dictionary-based technical term recognition is the first step for practical information extraction from biomedical documents because it provides ID information of recognized terms unlike machine learning based approaches. However, dictionary based approaches have two serious problems: (1) a large number of false recognitions mainly caused by short names. (2) low recall due to spelling variation. In this talk, we address the former problem by filtering out false positives using a machine learning technique. We alleviate the latter problem by using an approximate string searching method. This talk also presents an algorithm to generate possible variants for biomedical terms, which is potentially useful for query and dictionary expansions. Experimental results using the MEDLINE corpus indicate that our methods will significantly improve the precision and recall of dictionary-based term recognition.