Amis - A maximum entropy estimator for feature forests

June 8th, 2006
Kazuhiro Yoshida
Department of Computer Science, University of Tokyo
kyoshida AT is.s.u-tokyo.ac.jp

Introduction
Requirements
- Hardware
- Software
Installation and startup
- Installation
- Startup
Input/output
Using reference distributions
Internal specs
Misc.
References

Introduction

This software is a parameter estimator for maximum entropy models [1]. Given a set of events as training data, the program outputs parameters that optimize the likelihood of the training data. The software supports the following functions.

Parameter estimation algorithm for feature forests [2]
Parameter estimation algorithms limited-memory BFGS [4], GIS [5]
MAP estimation using Gaussian prior distributions
Estimation of the ME models with inequality constraints [8] by using the BLMVM [7] algorithm.
Compression of event files

Requirements

Hardware

CPU: PentiumIII 500MHz or above
Memory: 256 MB or above
Hard disk: 50MB or above

The program can be compiled and run on IA machines of the above specs, or SPARC machines of the equivalent specs. More memory/hard disk will be required depending on the size of input data.

Software

OS: Linux, Solaris
Compiler: g++ 3.2 or above
Library: Standard C++ Library

The developers tested the program using g++ 3.2.3 and g++ 4.0.3.

Installation and startup

Installation

Amis can be compiled and installed by the following procedure. (In the following examples, % represents a shell prompt.)

Run the "configure" script

% ./configure

By default, the program is installed in /usr/local/. If you want to install Amis in another directory (let $DIR be the name of the directory), specify the option as follows.

% ./configure --prefix=$DIR

"configure" accepts various options other than "--prefix".

Option	Default	Valid values	Effect
--enable-debug	no	0 - 5 or no	Specify whether debug messages are printed or not. The greater value is given, and more messages are printed.
--enable-profile	no	0 - 5 or no	Specify whether profiling (measuring the execution time of each function) is enabled or not. The greater value is specified, and more functions are profiled.

For other options, see the help message of the configure script printed by --help option. (Note that the options other than those listed above may not work correctly.)

Compile the program by the "make" command
```
% make
```
The executable "amis" is created in "./src/".
Install the executable and manuals.
```
% make install
```

Now you have "/usr/local/bin/amis" installed.

Startup

To start up Amis, execute "amis" with an argument specifying a configuration file (described later).

% amis [configuration file]

You can omit the argument, if you have a configuration file named "amis.conf". When the specified configuration file is not accessible, the program stops with an error. Amis also accepts several startup options.

Each line of a configuration file consists of the name of a property and the specification of its value. An example configuration file is shown below.

DATA_FORMAT     Amis
FEATURE_TYPE    integer
MODEL_FILE      example.model
EVENT_FILE      example.event
OUTPUT_FILE     example.output
LOG_FILE        example.log
ESTIMATION_ALGORITHM    BFGS
NUM_ITERATIONS  1000
REPORT_INTERVAL 1
PRECISION       6

Startup options of amis is specified as follows.

% amis -e foo.event -a BFGSMAP [configuration file]

Priority of the configuration is as below.

Startup options
Configuration file
Default values

You can specify the following items in a configuration file or as start-up options. Other items will be shown by "-h" or "--help" option.

Property name	Startup option	Default value	Valid values	Effect
BC_LOWER	--bc-lower	1.0	Real value greater than 0	Lower bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by B in[8] is set to the reciprocal of the specified number.
BC_UPPER	--bc-upper	1.0	Real value greater than 0	Upper bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by A in[8] is set to the reciprocal of the specified number.
DATA_FORMAT	--data-format, -d	Amis	Amis, AmisTree, AmisFix	Data format of input files.
ESTIMATION_ALGORITHM	--estimation-algorithm, -a	GIS	GIS, GISMAP, BFGS, BFGSMAP, BLMVMBC, BLMVMBCMAP	An algorithm used for parameter estimation.
EVENT_FILE [1] [2] ... [n]	--event-file, -e	amis.event	list of file names	The list of the names of input event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
EVENT_FILE_COMPRESSION	--event-file-compression		raw, gz, bz2	The compression format of input event files.
EVENT_ON_FILE	--event-on-file	false	truth value	Put the event data on a file. (Used when the size of the event file is too large for the main memory.)
EVENT_ON_FILE_NAME	--event-on-file-name	amis.event.tmp	file name	The name of the file used by EVENT_ON_FILE.
FEATURE_TYPE	--feature-type, -f	binary	binary, integer, real	The range of the values of feature functions. Effects the speed of estimation.
FEATURE_WEIGHT_TYPE	--feature-weight-type, -w	alpha	lambda, alpha	Type of the parameter values in input/output files. (lambda's are the natural logs of alpha's)
FILTER_INACTIVE_FEATURES	--filter-inactive-features	false	truth value	Remove inactive features from output model files.
FIXMAP_FILE [1] [2] ... [n]	--fixmap-file, -x	amis.fixmap	list of file names	The list of the names of files, which specify the format of AmisFix event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
FIXMAP_FILE_COMPRESSION	--fixmap-file-compression		raw, gz, bz2	The compression format of FIXMAP_FILE.
LOG_FILE	--log-file, -l	amis.log	file name	The log file name.
MAP_SIGMA	--map-sigma, -s	1	Real value greater than 0	The standard deviation of the prior distribution used by GISMAP, BFGSMAP, BLMVMBCMAP.
MEMORY_SIZE	--memory-size	5	integer	The number of vectors kept by BFGS and BLMVM.
MODEL_FILE [1] [2] ... [n]	--model-file, -m	amis.model	list of file names	The list of the names of input model files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once.
MODEL_FILE_COMPRESSION	--model-file-compression		raw, gz, bz2	The compression format of input model files.
NUM_ITERATIONS	--num-iterations, -i	200	integer	Number of iterations.
OUTPUT_FILE	--output-file, -o	amis.output	file name	The name of the output model file.
OUTPUT_FILE_COMPRESSION	--output-file-compression		raw, gz, bz2	The compression format of output model files.
PARAMETER_TYPE	--parameter-type	alpha	alpha, lambda	The type of parameters used for internal computation. alpha is faster, but lambda is more robust.
PRECISION	--precision, -p	6	integer	The number of significant digits of floating point numbers.
REFERENCE_DISTRIBUTION	--reference	false	truth values	Use reference distributions.
REFERENCE_FILE [1] [2] ... [n]	--reference-file	amis.ref	list of file names	The list of the names of the reference probability files.
REPORT_INTERVAL	--report-interval, -r	1	integer	Interval of logging.

Input/output

You need at least two kinds of files other than the configuration file: the model file and the event file. Their formats are described below.

For each file, # to the end of line is a comment and ignored. Comments are treated as a space. Each token is separated by spaces or tabs, and "new line" represents the end of line. Colons (:) are a special character. When you want to use these special characters as a part of tokens, escape the character with a backslash (\). A backslash itself is represented as \\.

Input model files

A model file gives the names of feature functions and corresponding initial parameters. See the following example.

[feature name]    [initial value]
[feature name]    [initial value]
[feature name]    [initial value]
...

Each line corresponds to one feature. First, you specify the name of a feature. For feature names, you can use any characters except for spaces, tabs, colons (:), and pounds (#). Next, following spaces or tabs, specify the initial parameter of the feature. Initial values are given by C-style floating point values. Initial parameters can be any positive number (if FEATURE_WEIGHT_TYPE is lambda, it can be any real number) Usually, all the initial values are set to 1.0 (or 0.0, if FEATURE_WEIGHT_TYPE is lambda.)

Input event files

An event file gives a list of events used for training the model. An event in maximum entropy models consists of an observed event and its complement events. Complement events are a list of alternative events that could have been observed instead of an observed event. Each event is regarded as a process of selecting an observed event from a set of events consisting of observed and complement events. In Amis event files, both the observed and complement events are represented by enumerating the activated features of each event.

You have three choices of format of event files: Amis, AmisFix, and AmisTree formats. AmisFix format is used when the task is to select a label from a fixed number of labels. AmisTree format is used when events can be represented as feature forests.

Though Amis format and AmisTree format have, theoretically, equivalent expressive power, AmisTree format can be considerably efficient both in terms of time and space, when the problem allows compact description as feature forests. AmisFix format has restricted expressive power, but in case of simple classification problems, it can work with smaller space requirements. (Though it is possible that Amis format works better in terms of estimation time.)

Amis format

event_1
1    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
...

event_2
0    [feature] [feature] [feature] ...
1    [feature] [feature] [feature] ...
0    [feature] [feature] [feature] ...
...

...

As [feature], you can write

[feature name]

[feature name]:[feature value]

if FEATURE_TYPE is integer or real.

Each block separated by blank lines corresponds to one event. In the first line, you specify the name of an event. You can use any characters except for special characters mentioned above. (Event names can be arbitrary strings, because they don't affect the results of estimation.) Succeeding lines represent an observed event or complement events. At the beginning of a line, specify the number of times the event observed. For an observed event, it should be positive integer, and for a complement event it should be zero. Only one observed event is permitted for each event. After the number of observation, enumerate the names of activated features for an event. Each feature must be defined in a model file. If you specified a feature not found in a model file, it would be an error. The value of a feature function can be specified following the feature name. As in the above example, the feature value is specified following a colon (:). When omitted, it will be 1.

Each event description is separated by blank lines. Note that a line with only comments is also treated as a blank line.

AmisFix format

First, you must prepare a file specified as FIXMAP_FILE, which is used to map properties specified in event files to feature names. FIXMAP_FILE looks as follows.

[label name] [label name] [label name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
[property name] [label name] [feature name]  [label name] [feature name] ...
...

The format of an event file is as follows.

[label name] 1 [property] [property] [property] ...
[label name] 1 [property] [property] [property] ...
[label name] 1 [property] [property] [property] ...
...

As [property], you can write

[property name]

[property name]:[property value]

if FEATURE_TYPE is integer or real.

The first line of FIXMAP file enumerates the set of labels used for classification. Each of the other lines is a definition of a property, and specifies the name of the feature that will become active when the property and the label co-occur.

Each line of an event file corresponds to an observed event. The line starts with the name of an observed label and the frequency of observation. The rest of the line enumerates the observed properties. Amis generates an observed event automatically, by mapping the pairs of the observed label and enumerated properties to the features specified in FIXMAP_FILE. The complement events are generated the same way, using the labels other than the observed label.

AmisTree format

[event name]
[frequency]       [feature] [feature] [feature] ...
[disjunctive node]

BNF-like representation of [disjunctive node] is as follows:

[disjunctive node] :=
  [reference to some node name] |
  { [node name] [conjunctive node] [conjunctive node] ... }
[disjunctive node] :=
  [reference to some node name] |
  ( [node name] [feature] [feature] ... [disjunctive node] [disjunctive node] ... )

You must put spaces before and after curly braces and round brackets, so that they are not treated as a part of node names or feature names.

A real example looks like this:

event_1
2       feature1:2 feature2:3 feature3
{ dnode_1 ( node_1 feature1:2 { dnode_2 ( node_2 feature2:3 ) ( node_3 ) } { dnode_3 $node_2 ( node_4 feature3 ) } ) }

event_2
1       feature2:3
{ dnode_1 ( node_1 feature1 ) ( node_2 { dnode_2 ( node_3 feature2:3 ) ( node_4 feature3 ) } ) }

...

As in the Amis format, a blank line separates each event description. In the AmisTree format, an event is represented with three lines. The first line specifies the name of an event. In the second line, you show the number of times of the observed event and enumerate activated features of an observed event. In the above example, event_1 is observed twice, and event_2 once. As in the Amis format, specify the name of a feature together with its value. The third line represents an observed event and complement events in a feature forest. Disjunctive nodes are represented with curly braces. Between the curly braces, the name of a node is first specified, and conjunctive nodes follow. Node names must be unique in each event, but you can use '_' for the nodes which are never refered to. Conjunctive nodes are represented with round brackets. Between the round brackets, the name of a node is first specified, and activated features follow. Feature descriptions are the same as Amis format. You can also specify disjunctive nodes as daughter nodes. Node names are used to represent structure-sharing. Already appeared nodes can be refered by "$" followed by the node name. In event_1 in the above example, $node2 represents the sharing with node2 already appeared. By using node sharing and pack the feature forest smaller, the computational complexity reduces, and the computation will be accelerated. You can use any characters except for special characters for node names and feature names.

From the feature forests, we can extract a set of feature lists by the following mutual-recursive procedures.

When the node is [disjunctive node], return the feature list which is obtained from one of its daughters.
When the node is [conjunctive node], return the union of the feature list of the node and the feature lists obtained from all of its daughters.

An event written in AmisTree format can be converted into an Amis format event which have all the feature lists that can be extracted from the feature forest as its observed and complement events. The result of parameter estimation for AmisTree format events is equivalent to that for such an Amis format events, but the computational cost of the estimation is proportional to the size of the feature forests.

Output

Amis outputs pairs of feature and parameter. The output format is the same as a model file. Since the output format is the same as a model file, the output file can be reused as an input of the new computation. That is, we can further progress the parameter estimation given already estimated data.

Parameters a_i corresponding to feature functions are output, and we can compute the probability of an unknown event by the product of a_i for all activated features.

The tractable size of events and features depends on the size of system resources (memory and hard disk).
When the memory is insufficient, try EVENT_ON_FILE option.
If you see "Infinite!" messages in GIS or IIS algorithms, or BFGS does not converge (you will see "Line search in BFGS algorithm ..."), parameter estimation seems to cause overflow or underflow. In such cases, specify "--parameter-type lambda" as a startup option.
In feature forest models, a feature vector specified in a line of an observed event must be included in a feature forest. "amis" does not check this condition. If this is not satisfied, parameter estimation may not converge. This condition can be checked by the tool "forestcheck".