STePP Tagger

- a Simple Trainable Probabilistic Part-of-speech Tagger -


Overview

The STePP tagger is a general-purpose part-of-speech tagger using log-linear probabilistic models. The main features of this tagger are:

How to Build

The tagger is tested only on linux and gcc.

1. Download the latest version of the tagger

2. Expand the archive

> tar xvzf stepp-x.x.tar.gz

3. Make

> cd stepp-x.x/
> make

How to Use

The default package contains a compact model which should work well for ordinary English sentences (a slightly more accurate (+0.05%) but heavy model is available on this page). You can perform part-of-speech tagging using this model with the following command:

% ./stepp -m ./models_wsj02-21c < samples/test.txt > tmp

Note that by default the input must be one-sentence-per-line, and the words have to be tokenized with white spaces.

If you want the tagger to perform tokenization, use -t option. In this case, you might find "--standoff" option useful because it allows you to easily map the output tokens into the original input string.

If necessary, you can get tag probabilities for each word,

% ./stepp -p -m ./models_wsj02-21c < samples/test.txt > tmp

The tagger has a fast-tagging mode, which is enabled by -f option. The tagging accuracy of the fast mode is slightly lower than that of the normal mode (about -0.1% on WSJ), but the tagging speed is significantly faster.

% ./stepp -f -m ./models_wsj02-21c < samples/test.txt > tmp

You can display help messages by -h option.

% ./stepp -h

How to Train the Tagger

You can build a tagging model (a collection of probabilistic models) using your own annotated corpus. Use the "stepp-learn" command:

% ./stepp-learn -m ./models samples/train.pos

Once you have trained the model, you can use it by specifying the directory that contains the model files generated.

% ./stepp -m ./models < samples/test.txt > tmp

How to Evaluate the Tagger

You can evaluate tagging accuracy with the "stepp-eval" command.

% ./stepp-eval samples/test.pos tmp samples/train.pos

Tips

Developers

Yoshimasa Tsuruoka (University of Manchester)

Daisuke Okanohara (University of Tokyo)


This page is maintained by Yoshimasa Tsuruoka