GENIA Sentence Splitter

GENIA Sentence Splitter (GeniaSS) [1] is a sentence splitter optimized for biomedical texts. GeniaSS reads a text and splits it into sentences by inserting line breaks.


Classification model

First, GeniaSS detects candidate positions for splitting using selected delimiters: periods, commas, single/double quotation marks, right parentheses, etc. Then, it classifies whether each candidate really splits the sentence or not.

Classifier Features


How to use

1) make
2) ./geniass arg1 arg2

arg1 is a target file to split. arg2 is an output file name. If you want to get stand-off format file, please run

3) ruby sentence2standOff.rb arg1 arg2 arg3

arg1 and arg2 are same as 2). arg3 is an output stand-off file name.

Note: you need to run GeniaSS in the directory which includes EventExtracter.rb, Classifying2Splitting.rb, model1-1.0.


[1] Sætre, Rune, Kazuhiro Yoshida, Akane Yakushiji, Yusuke Miyao, Yuichiro Matsubayashi and Tomoko Ohta., AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge, PPI-IPS subtask. In Proceedings of the Second BioCreative Challenge Evaluation Workshop. pp. 209--212, April 2007. CNIO. 

[2] Yoshimasa Tsuruoka., A simple C++ library for maximum entropy classification, , 2005.

[3] Kim J.D., Ohta T., Tateishi Y., and Tsujii J., GENIA corpus - a semantically annotated corpus for bio-textmining, Bioinformatics, 19(suppl. 1):180–i182, 2003.

Created and maintained by Yuichiroh Matsubayashi