Part-of-Speech Tagger for Biomedical Text
TODO
- convert the upenn biocorpus data into the "pos" format. (done)
- Spaces within a token were replaced with underscores.
- training data (the first 90% of each (cyp450 and oncology) )
- test data (the rest)
- train the tagger (done)
- use PTB, GENIA and the above upennbio corpus as the training data.
- evaluate the tagger (done)
- tagging accuracy: 96.94% on PTB, 98.26% on GENIA, 97.78% on UpennBio.
- compare with the GENIA tagger and the TnT tagger:
- build the server
Acronym Generator
The acronym generator takes a term (typically multi-word term)
and generates possible acronyms for the term.
ex.) human growth hormone -> HGF
TODO
- collect papers on acronym detection (done)
- make the training data (done)
- extract definition-acronym pairs from medline abstracts
- annotate each pair with positional information
- 1,901 pairs
- build the generator (done)
- evaluate (done)
- 5-fold cross-validation
- Coverage: 55.2% (Top1), 75.4% (Top 5), 82.2% (Top 10)
- write a paper (done)
"A Machine Learning Approach to Acronym Generation"(pdf)
Chunker for Biomedical Text
TODO
- convert the GENIA and upenn-biocorpus data into the "chunk" format.
- train the chuker
This page is maintained by TSURUOKA Yoshimasa