GENIA Sentence Splitter (GeniaSS) [1] is a sentence splitter optimized for biomedical texts. GeniaSS reads a text and splits it into sentences by inserting line breaks.
First, GeniaSS detects candidate positions for splitting using selected delimiters: periods, commas, single/double quotation marks, right parentheses, etc. Then, it classifies whether each candidate really splits the sentence or not.
1) make
2) ./geniass arg1 arg2
arg1 is a target file to split. arg2 is an output file name. If you want to get stand-off format file, please run
3) ruby sentence2standOff.rb arg1 arg2 arg3
arg1 and arg2 are same as 2). arg3 is an output stand-off file name.
Note: you need to run GeniaSS in the directory which includes EventExtracter.rb, Classifying2Splitting.rb, model1-1.0.
[1] Sætre, Rune, Kazuhiro Yoshida, Akane Yakushiji, Yusuke Miyao, Yuichiro Matsubayashi and Tomoko Ohta., AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge, PPI-IPS subtask. In Proceedings of the Second BioCreative Challenge Evaluation Workshop. pp. 209--212, April 2007. CNIO.
[2] Yoshimasa Tsuruoka., A simple C++ library for maximum entropy classification, http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/ , 2005.
[3] Kim J.D., Ohta T., Tateishi Y., and Tsujii J., GENIA corpus - a semantically annotated corpus for bio-textmining, Bioinformatics, 19(suppl. 1):180–i182, 2003.
Created and maintained by Yuichiroh Matsubayashi