A Priority Model for Named Entities
John Wilbur
(National Center for Biotechnology Information,
National Library of Medicine)
We introduce a new approach to named entity classification
which we term a Priority Model. We also describe the construction of a
semantic data-base called SemCat consisting of a large number of names
relevant to biomedicine together with data on their semantic categories.
We used SemCat as training data to investigate name classification
techniques. We generated a statistical language model and probabilistic
context-free grammars for gene and protein name classification, and
com-pared the results with the new model. For all three methods, we
used a variable order Markov model to predict the nature of strings not
represented in the training data. The Priority Model achieves an
F-measure of 0.958-0.960, consistently higher than the statistical
language model and probabilistic context-free grammar.