nactem

Part-of-Speech Tagger for Biomedical Text

TODO

convert the upenn biocorpus data into the "pos" format. (done)
- Spaces within a token were replaced with underscores.
- training data (the first 90% of each (cyp450 and oncology) )
- test data (the rest)
train the tagger (done)
- use PTB, GENIA and the above upennbio corpus as the training data.
evaluate the tagger (done)
- tagging accuracy: 96.94% on PTB, 98.26% on GENIA, 97.78% on UpennBio.
- compare with the GENIA tagger and the TnT tagger:
  - three articles from the latest issue of Nucleic Acid Research (2005 March)
  - tagging results
build the server

The acronym generator takes a term (typically multi-word term) and generates possible acronyms for the term.
ex.) human growth hormone -> HGF

TODO

collect papers on acronym detection (done)
make the training data (done)
- extract definition-acronym pairs from medline abstracts
- annotate each pair with positional information
- 1,901 pairs
build the generator (done)
- MEMM-based algorithm
evaluate (done)
- 5-fold cross-validation
- Coverage: 55.2% (Top1), 75.4% (Top 5), 82.2% (Top 10)
write a paper (done) "A Machine Learning Approach to Acronym Generation"(pdf)

TODO

This page is maintained by TSURUOKA Yoshimasa