A part-of-speech tagger for English

Developed at:
University of Tokyo, Department of Computer Science,
Tsujii laboratory

Version 1.0

Overview

Tagging speed is crucial in large-scale information extraction and real-time NLP applications. This part-of-speech (POS) tagger offers fast tagging (2400 tokens/sec) with a state-of-the-art accuracy (97.10% on the WSJ corpus). The tagger uses an extension of Maximum Entropy Markov Models (MEMM), in which tags are determined in the easiest-first mannar. For details of the algorithm and performance, see [1].

Note: This page is no longer maintained. Click here for a more accurate and trainable version of the tagger.

How to use the tagger

The tagger is tested only on linux and gcc.

1. Download the latest version of the tagger

Jul. 11 2005 postagger-1.0.zip (binary for Windows)
Jul. 8 2005 postagger-1.0.tar.gz (sources for Unix)

2. Expand the archive


> tar xvzf postagger.tar.gz

3. Make


> cd postagger/ 

> make

4. Tag sentences

Prepare a text file containing one sentence per line, then,


> ./tagger < TEXTFILE > TAGGEDTEXT

Example

> echo "He opened the window." | ./tagger
He/PRP opened/VBD the/DT window/NN ./.
>

References

[1] Yoshimasa Tsuruoka and Jun'ichi Tsujii, Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP 2005, pp. 467-474. (pdf).

This page is maintained by Yoshimasa Tsuruoka