Amis - A maximum entropy estimator for feature forests

June 8th, 2006
Kazuhiro Yoshida
Department of Computer Science, University of Tokyo
kyoshida AT is.s.u-tokyo.ac.jp
Japanese version

Contents


Introduction

This software is a parameter estimator for maximum entropy models [1]. Given a set of events as training data, the program outputs parameters that optimize the likelihood of the training data. The software supports the following functions.


Requirements

Hardware

CPU
PentiumIII 500MHz or above
Memory
256 MB or above
Hard disk
50MB or above

The program can be compiled and run on IA machines of the above specs, or SPARC machines of the equivalent specs. More memory/hard disk will be required depending on the size of input data.

Software

OS
Linux, Solaris
Compiler
g++ 3.2 or above
Library
Standard C++ Library

The developers tested the program using g++ 3.2.3 and g++ 4.0.3.


Installation and startup

Installation

Amis can be compiled and installed by the following procedure. (In the following examples, % represents a shell prompt.)

  1. Run the "configure" script
    % ./configure
    By default, the program is installed in /usr/local/. If you want to install Amis in another directory (let $DIR be the name of the directory), specify the option as follows.
    % ./configure --prefix=$DIR
    "configure" accepts various options other than "--prefix".
    OptionDefaultValid valuesEffect
    --enable-debugno0 - 5 or no Specify whether debug messages are printed or not. The greater value is given, and more messages are printed.
    --enable-profileno0 - 5 or no Specify whether profiling (measuring the execution time of each function) is enabled or not. The greater value is specified, and more functions are profiled.
    For other options, see the help message of the configure script printed by --help option. (Note that the options other than those listed above may not work correctly.)
  2. Compile the program by the "make" command
    % make
    The executable "amis" is created in "./src/".
  3. Install the executable and manuals.
    % make install

Now you have "/usr/local/bin/amis" installed.

Startup

To start up Amis, execute "amis" with an argument specifying a configuration file (described later).

% amis [configuration file]
You can omit the argument, if you have a configuration file named "amis.conf". When the specified configuration file is not accessible, the program stops with an error. Amis also accepts several startup options.

Each line of a configuration file consists of the name of a property and the specification of its value. An example configuration file is shown below.

DATA_FORMAT     Amis
FEATURE_TYPE    integer
MODEL_FILE      example.model
EVENT_FILE      example.event
OUTPUT_FILE     example.output
LOG_FILE        example.log
ESTIMATION_ALGORITHM    BFGS
NUM_ITERATIONS  1000
REPORT_INTERVAL 1
PRECISION       6

Startup options of amis is specified as follows.

% amis -e foo.event -a BFGSMAP [configuration file]

Priority of the configuration is as below.

  1. Startup options
  2. Configuration file
  3. Default values

You can specify the following items in a configuration file or as start-up options. Other items will be shown by "-h" or "--help" option.

Property nameStartup optionDefault valueValid valuesEffect
BC_LOWER--bc-lower1.0Real value greater than 0 Lower bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by B in[8] is set to the reciprocal of the specified number.
BC_UPPER--bc-upper1.0Real value greater than 0 Upper bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by A in[8] is set to the reciprocal of the specified number.
DATA_FORMAT--data-format, -dAmisAmis, AmisTree, AmisFix Data format of input files.
ESTIMATION_ALGORITHM--estimation-algorithm, -aGISGIS, GISMAP, BFGS, BFGSMAP, BLMVMBC, BLMVMBCMAP An algorithm used for parameter estimation.
EVENT_FILE [1] [2] ... [n]--event-file, -e amis.eventlist of file names The list of the names of input event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
EVENT_FILE_COMPRESSION--event-file-compressionraw, gz, bz2 The compression format of input event files.
EVENT_ON_FILE--event-on-filefalsetruth value Put the event data on a file. (Used when the size of the event file is too large for the main memory.)
EVENT_ON_FILE_NAME--event-on-file-nameamis.event.tmpfile name The name of the file used by EVENT_ON_FILE.
FEATURE_TYPE--feature-type, -fbinarybinary, integer, real The range of the values of feature functions. Effects the speed of estimation.
FEATURE_WEIGHT_TYPE--feature-weight-type, -walphalambda, alpha Type of the parameter values in input/output files. (lambda's are the natural logs of alpha's)
FILTER_INACTIVE_FEATURES--filter-inactive-featuresfalsetruth value Remove inactive features from output model files.
FIXMAP_FILE [1] [2] ... [n]--fixmap-file, -x amis.fixmaplist of file names The list of the names of files, which specify the format of AmisFix event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
FIXMAP_FILE_COMPRESSION--fixmap-file-compressionraw, gz, bz2 The compression format of FIXMAP_FILE.
LOG_FILE--log-file, -lamis.logfile name The log file name.
MAP_SIGMA--map-sigma, -s1Real value greater than 0 The standard deviation of the prior distribution used by GISMAP, BFGSMAP, BLMVMBCMAP.
MEMORY_SIZE--memory-size5integer The number of vectors kept by BFGS and BLMVM.
MODEL_FILE [1] [2] ... [n]--model-file, -mamis.modellist of file names The list of the names of input model files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
MODEL_FILE_COMPRESSION--model-file-compressionraw, gz, bz2 The compression format of input model files.
NUM_ITERATIONS--num-iterations, -i200integer Number of iterations.
OUTPUT_FILE--output-file, -oamis.outputfile name The name of the output model file.
OUTPUT_FILE_COMPRESSION--output-file-compressionraw, gz, bz2 The compression format of output model files.
PARAMETER_TYPE--parameter-typealphaalpha, lambda The type of parameters used for internal computation. alpha is faster, but lambda is more robust.
PRECISION--precision, -p6integer The number of significant digits of floating point numbers.
REFERENCE_DISTRIBUTION--referencefalsetruth values Use reference distributions.
REFERENCE_FILE [1] [2] ... [n]--reference-fileamis.reflist of file names The list of the names of the reference probability files.
REPORT_INTERVAL--report-interval, -r1integer Interval of logging.


Input/output

You need at least two kinds of files other than the configuration file: the model file and the event file. Their formats are described below.

For each file, # to the end of line is a comment and ignored. Comments are treated as a space. Each token is separated by spaces or tabs, and "new line" represents the end of line. Colons (:) are a special character. When you want to use these special characters as a part of tokens, escape the character with a backslash (\). A backslash itself is represented as \\.

Input model files

A model file gives the names of feature functions and corresponding initial parameters. See the following example.

[feature name]    [initial value]
[feature name]    [initial value]
[feature name]    [initial value]
...

Each line corresponds to one feature. First, you specify the name of a feature. For feature names, you can use any characters except for spaces, tabs, colons (:), and pounds (#). Next, following spaces or tabs, specify the initial parameter of the feature. Initial values are given by C-style floating point values. Initial parameters can be any positive number (if FEATURE_WEIGHT_TYPE is lambda, it can be any real number) Usually, all the initial values are set to 1.0 (or 0.0, if FEATURE_WEIGHT_TYPE is lambda.)

Input event files

An event file gives a list of events used for training the model. An event in maximum entropy models consists of an observed event and its complement events. Complement events are a list of alternative events that could have been observed instead of an observed event. Each event is regarded as a process of selecting an observed event from a set of events consisting of observed and complement events. In Amis event files, both the observed and complement events are represented by enumerating the activated features of each event.

You have three choices of format of event files: Amis, AmisFix, and AmisTree formats. AmisFix format is used when the task is to select a label from a fixed number of labels. AmisTree format is used when events can be represented as feature forests.

Though Amis format and AmisTree format have, theoretically, equivalent expressive power, AmisTree format can be considerably efficient both in terms of time and space, when the problem allows compact description as feature forests. AmisFix format has restricted expressive power, but in case of simple classification problems, it can work with smaller space requirements. (Though it is possible that Amis format works better in terms of estimation time.)

Amis format

event_1
1    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
...

event_2
0    [feature] [feature] [feature] ...
1    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
...

...
As [feature], you can write
[feature name]
or
[feature name]:[feature value]
if FEATURE_TYPE is integer or real.

Each block separated by blank lines corresponds to one event. In the first line, you specify the name of an event. You can use any characters except for special characters mentioned above. (Event names can be arbitrary strings, because they don't affect the results of estimation.) Succeeding lines represent an observed event or complement events. At the beginning of a line, specify the number of times the event observed. For an observed event, it should be positive integer, and for a complement event it should be zero. Only one observed event is permitted for each event. After the number of observation, enumerate the names of activated features for an event. Each feature must be defined in a model file. If you specified a feature not found in a model file, it would be an error. The value of a feature function can be specified following the feature name. As in the above example, the feature value is specified following a colon (:). When omitted, it will be 1.

Each event description is separated by blank lines. Note that a line with only comments is also treated as a blank line.

AmisFix format

First, you must prepare a file specified as FIXMAP_FILE, which is used to map properties specified in event files to feature names. FIXMAP_FILE looks as follows.

[label name] [label name] [label name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
...

The format of an event file is as follows.

[label name] 1 [property] [property] [property] ...
[label name] 1 [property] [property] [property] ...
[label name] 1 [property] [property] [property] ...
...
As [property], you can write
[property name]
or
[property name]:[property value]
if FEATURE_TYPE is integer or real.

The first line of FIXMAP file enumerates the set of labels used for classification. Each of the other lines is a definition of a property, and specifies the name of the feature that will become active when the property and the label co-occur.

Each line of an event file corresponds to an observed event. The line starts with the name of an observed label and the frequency of observation. The rest of the line enumerates the observed properties. Amis generates an observed event automatically, by mapping the pairs of the observed label and enumerated properties to the features specified in FIXMAP_FILE. The complement events are generated the same way, using the labels other than the observed label.

AmisTree format

[event name]
[frequency]       [feature] [feature] [feature] ...
[disjunctive node]
BNF-like representation of [disjunctive node] is as follows:
[disjunctive node] :=
  [reference to some node name] |
  { [node name] [conjunctive node] [conjunctive node] ... }
[disjunctive node] :=
  [reference to some node name] |
  ( [node name] [feature] [feature] ... [disjunctive node] [disjunctive node] ... )
You must put spaces before and after curly braces and round brackets, so that they are not treated as a part of node names or feature names.

A real example looks like this:

event_1
2       feature1:2 feature2:3 feature3
{ dnode_1 ( node_1 feature1:2 { dnode_2 ( node_2 feature2:3 ) ( node_3 ) } { dnode_3 $node_2 ( node_4 feature3 ) } ) }

event_2
1       feature2:3
{ dnode_1 ( node_1 feature1 ) ( node_2 { dnode_2 ( node_3 feature2:3 ) ( node_4 feature3 ) } ) }

...

As in the Amis format, a blank line separates each event description. In the AmisTree format, an event is represented with three lines. The first line specifies the name of an event. In the second line, you show the number of times of the observed event and enumerate activated features of an observed event. In the above example, event_1 is observed twice, and event_2 once. As in the Amis format, specify the name of a feature together with its value. The third line represents an observed event and complement events in a feature forest. Disjunctive nodes are represented with curly braces. Between the curly braces, the name of a node is first specified, and conjunctive nodes follow. Node names must be unique in each event, but you can use '_' for the nodes which are never refered to. Conjunctive nodes are represented with round brackets. Between the round brackets, the name of a node is first specified, and activated features follow. Feature descriptions are the same as Amis format. You can also specify disjunctive nodes as daughter nodes. Node names are used to represent structure-sharing. Already appeared nodes can be refered by "$" followed by the node name. In event_1 in the above example, $node2 represents the sharing with node2 already appeared. By using node sharing and pack the feature forest smaller, the computational complexity reduces, and the computation will be accelerated. You can use any characters except for special characters for node names and feature names.

From the feature forests, we can extract a set of feature lists by the following mutual-recursive procedures.

An event written in AmisTree format can be converted into an Amis format event which have all the feature lists that can be extracted from the feature forest as its observed and complement events. The result of parameter estimation for AmisTree format events is equivalent to that for such an Amis format events, but the computational cost of the estimation is proportional to the size of the feature forests.

Output

Amis outputs pairs of feature and parameter. The output format is the same as a model file. Since the output format is the same as a model file, the output file can be reused as an input of the new computation. That is, we can further progress the parameter estimation given already estimated data.

Parameters a_i corresponding to feature functions are output, and we can compute the probability of an unknown event by the product of a_i for all activated features.


Using reference distributions

Internal specs

Coming soon...

Misc. (restrictions and known bugs)


References

[1] Adam L. Berger, Stephen A. Della Pietra and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996.

[2] Yusuke Miyao and Jun'ichi Tsujii. Maximum entropy estimation for feature forests. In Proc. HLT2002.

[3] Stephen A. Della Pietra, Vincent J. Della Pietra and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380-393, 1997.

[4] Jorge Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of Computation, 35:773-783, 1980.

[5] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:1470-1480, 1972.

[6] Stanley F. Chen and Ronald Rosenfeld. A Gaussian prior for smoothing maximum entropy models. Technical Report CMUCS -99-108, Carnegie Mellon University, 1999.

[7] Steven J. Benson and Jorge More'. A Limited-Memory Variable-Metric Algorithm for Bound-Constrained Minimization. Mathematics and Computer Science Division, Argonne National Laboratory, ANL/MCS-P909-0901, 2001.

[8] Jun'ichi Kazama and Jun'ichi Tsujii. Evaluation and Extension of Maximum Entropy Models with Inequality Constraints. In the Proceedings of EMNLP 2003.