STePP Tagger

STePP Tagger

- a Simple Trainable Probabilistic Part-of-speech Tagger -

Overview

The STePP tagger is a general-purpose part-of-speech tagger using log-linear probabilistic models. The main features of this tagger are:

state-of-the-art accuracy (97.3% on the WSJ corpus)
fast tagging (<1ms/sentence in the "fast" mode)
can output tag probabilities for each token
can output n-best tagging sequences
trainable using your own tagged corpus
can build compact models using feature selection (L1-regularization)

How to Build

The tagger is tested only on linux and gcc.

1. Download the latest version of the tagger

2. Expand the archive


> tar xvzf stepp-x.x.tar.gz

3. Make


> cd stepp-x.x/ 

> make

How to Use

The default package contains a compact model which should work well for ordinary English sentences (a slightly more accurate (+0.05%) but heavy model is available on this page). You can perform part-of-speech tagging using this model with the following command:


    % ./stepp -m ./models_wsj02-21c < samples/test.txt > tmp

Note that by default the input must be one-sentence-per-line, and the words have to be tokenized with white spaces.

If you want the tagger to perform tokenization, use -t option. In this case, you might find "--standoff" option useful because it allows you to easily map the output tokens into the original input string.

If necessary, you can get tag probabilities for each word,


    % ./stepp -p -m ./models_wsj02-21c < samples/test.txt > tmp

The tagger has a fast-tagging mode, which is enabled by -f option. The tagging accuracy of the fast mode is slightly lower than that of the normal mode (about -0.1% on WSJ), but the tagging speed is significantly faster.


    % ./stepp -f -m ./models_wsj02-21c < samples/test.txt > tmp

You can display help messages by -h option.


    % ./stepp -h

How to Train the Tagger

You can build a tagging model (a collection of probabilistic models) using your own annotated corpus. Use the "stepp-learn" command:


    % ./stepp-learn -m ./models samples/train.pos

Once you have trained the model, you can use it by specifying the directory that contains the model files generated.


    % ./stepp -m ./models < samples/test.txt > tmp

How to Evaluate the Tagger

You can evaluate tagging accuracy with the "stepp-eval" command.


    % ./stepp-eval samples/test.pos tmp samples/train.pos

Tips

The memory and the time required for training vary depending on the tagset and the size of the corpus. The training using sections 0-18 of the WSJ corpus used 1.3GB memory and took 8 hours on an AMD Opteron server.
You can build compact models by using L1 regularization (a kind of feature selection). Try -c option when building the models. Training will take much longer, though.
Although the normal (CRF + ME) mode usually gives better accuracy than the fast (CRF-only) mode, there are some cases where the latter performs better. It may be worth trying the CRF-only mode especially when the size of the training data is small.

Developers

Yoshimasa Tsuruoka (University of Manchester)

Daisuke Okanohara (University of Tokyo)

This page is maintained by Yoshimasa Tsuruoka