This software is a parameter estimator for maximum entropy models [1]. Given a set of events as training data, the program outputs parameters that optimize the likelihood of the training data. The software supports the following functions.
The program can be compiled and run on IA machines of the above specs, or SPARC machines of the equivalent specs. More memory/hard disk will be required depending on the size of input data.
The developers tested the program using g++ 3.2.3 and g++ 4.0.3.
Amis can be compiled and installed by the following procedure. (In the following examples, % represents a shell prompt.)
% ./configureBy default, the program is installed in /usr/local/. If you want to install Amis in another directory (let $DIR be the name of the directory), specify the option as follows.
% ./configure --prefix=$DIR"configure" accepts various options other than "--prefix".
Option | Default | Valid values | Effect |
---|---|---|---|
--enable-debug | no | 0 - 5 or no | Specify whether debug messages are printed or not. The greater value is given, and more messages are printed. |
--enable-profile | no | 0 - 5 or no | Specify whether profiling (measuring the execution time of each function) is enabled or not. The greater value is specified, and more functions are profiled. |
% makeThe executable "amis" is created in "./src/".
% make install
Now you have "/usr/local/bin/amis" installed.
To start up Amis, execute "amis" with an argument specifying a configuration file (described later).
% amis [configuration file]You can omit the argument, if you have a configuration file named "amis.conf". When the specified configuration file is not accessible, the program stops with an error. Amis also accepts several startup options.
Each line of a configuration file consists of the name of a property and the specification of its value. An example configuration file is shown below.
DATA_FORMAT Amis FEATURE_TYPE integer MODEL_FILE example.model EVENT_FILE example.event OUTPUT_FILE example.output LOG_FILE example.log ESTIMATION_ALGORITHM BFGS NUM_ITERATIONS 1000 REPORT_INTERVAL 1 PRECISION 6
Startup options of amis is specified as follows.
% amis -e foo.event -a BFGSMAP [configuration file]
Priority of the configuration is as below.
You can specify the following items in a configuration file or as start-up options. Other items will be shown by "-h" or "--help" option.
Property name | Startup option | Default value | Valid values | Effect |
---|---|---|---|---|
BC_LOWER | --bc-lower | 1.0 | Real value greater than 0 | Lower bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by B in[8] is set to the reciprocal of the specified number. |
BC_UPPER | --bc-upper | 1.0 | Real value greater than 0 | Upper bounds of the box constraints used by BLMVMBC, BLMVMBCMAP. The value which is represented by A in[8] is set to the reciprocal of the specified number. |
DATA_FORMAT | --data-format, -d | Amis | Amis, AmisTree, AmisFix | Data format of input files. |
ESTIMATION_ALGORITHM | --estimation-algorithm, -a | GIS | GIS, GISMAP, BFGS, BFGSMAP, BLMVMBC, BLMVMBCMAP | An algorithm used for parameter estimation. |
EVENT_FILE [1] [2] ... [n] | --event-file, -e | amis.event | list of file names | The list of the names of input event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once. |
EVENT_FILE_COMPRESSION | --event-file-compression | raw, gz, bz2 | The compression format of input event files. | |
EVENT_ON_FILE | --event-on-file | false | truth value | Put the event data on a file. (Used when the size of the event file is too large for the main memory.) |
EVENT_ON_FILE_NAME | --event-on-file-name | amis.event.tmp | file name | The name of the file used by EVENT_ON_FILE. |
FEATURE_TYPE | --feature-type, -f | binary | binary, integer, real | The range of the values of feature functions. Effects the speed of estimation. |
FEATURE_WEIGHT_TYPE | --feature-weight-type, -w | alpha | lambda, alpha | Type of the parameter values in input/output files. (lambda's are the natural logs of alpha's) |
FILTER_INACTIVE_FEATURES | --filter-inactive-features | false | truth value | Remove inactive features from output model files. |
FIXMAP_FILE [1] [2] ... [n] | --fixmap-file, -x | amis.fixmap | list of file names | The list of the names of files, which specify the format of AmisFix event files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once. |
FIXMAP_FILE_COMPRESSION | --fixmap-file-compression | raw, gz, bz2 | The compression format of FIXMAP_FILE. | |
LOG_FILE | --log-file, -l | amis.log | file name | The log file name. |
MAP_SIGMA | --map-sigma, -s | 1 | Real value greater than 0 | The standard deviation of the prior distribution used by GISMAP, BFGSMAP, BLMVMBCMAP. |
MEMORY_SIZE | --memory-size | 5 | integer | The number of vectors kept by BFGS and BLMVM. |
MODEL_FILE [1] [2] ... [n] | --model-file, -m | amis.model | list of file names | The list of the names of input model files. When the file name is '-', the corresponding input is read from standard input. This startup option can appear more than once. |
MODEL_FILE_COMPRESSION | --model-file-compression | raw, gz, bz2 | The compression format of input model files. | |
NUM_ITERATIONS | --num-iterations, -i | 200 | integer | Number of iterations. |
OUTPUT_FILE | --output-file, -o | amis.output | file name | The name of the output model file. |
OUTPUT_FILE_COMPRESSION | --output-file-compression | raw, gz, bz2 | The compression format of output model files. | |
PARAMETER_TYPE | --parameter-type | alpha | alpha, lambda | The type of parameters used for internal computation. alpha is faster, but lambda is more robust. |
PRECISION | --precision, -p | 6 | integer | The number of significant digits of floating point numbers. |
REFERENCE_DISTRIBUTION | --reference | false | truth values | Use reference distributions. |
REFERENCE_FILE [1] [2] ... [n] | --reference-file | amis.ref | list of file names | The list of the names of the reference probability files. |
REPORT_INTERVAL | --report-interval, -r | 1 | integer | Interval of logging. |
You need at least two kinds of files other than the configuration file: the model file and the event file. Their formats are described below.
For each file, # to the end of line is a comment and ignored. Comments are treated as a space. Each token is separated by spaces or tabs, and "new line" represents the end of line. Colons (:) are a special character. When you want to use these special characters as a part of tokens, escape the character with a backslash (\). A backslash itself is represented as \\.
A model file gives the names of feature functions and corresponding initial parameters. See the following example.
[feature name] [initial value] [feature name] [initial value] [feature name] [initial value] ...
Each line corresponds to one feature. First, you specify the name of a feature. For feature names, you can use any characters except for spaces, tabs, colons (:), and pounds (#). Next, following spaces or tabs, specify the initial parameter of the feature. Initial values are given by C-style floating point values. Initial parameters can be any positive number (if FEATURE_WEIGHT_TYPE is lambda, it can be any real number) Usually, all the initial values are set to 1.0 (or 0.0, if FEATURE_WEIGHT_TYPE is lambda.)
An event file gives a list of events used for training the model. An event in maximum entropy models consists of an observed event and its complement events. Complement events are a list of alternative events that could have been observed instead of an observed event. Each event is regarded as a process of selecting an observed event from a set of events consisting of observed and complement events. In Amis event files, both the observed and complement events are represented by enumerating the activated features of each event.
You have three choices of format of event files: Amis, AmisFix, and AmisTree formats. AmisFix format is used when the task is to select a label from a fixed number of labels. AmisTree format is used when events can be represented as feature forests.
Though Amis format and AmisTree format have, theoretically, equivalent expressive power, AmisTree format can be considerably efficient both in terms of time and space, when the problem allows compact description as feature forests. AmisFix format has restricted expressive power, but in case of simple classification problems, it can work with smaller space requirements. (Though it is possible that Amis format works better in terms of estimation time.)
event_1 1 [feature] [feature] [feature] ... 0 [feature] [feature] [feature] ... 0 [feature] [feature] [feature] ... 0 [feature] [feature] [feature] ... ... event_2 0 [feature] [feature] [feature] ... 1 [feature] [feature] [feature] ... 0 [feature] [feature] [feature] ... ... ...As [feature], you can write
[feature name]or
[feature name]:[feature value]if FEATURE_TYPE is integer or real.
Each block separated by blank lines corresponds to one event. In the first line, you specify the name of an event. You can use any characters except for special characters mentioned above. (Event names can be arbitrary strings, because they don't affect the results of estimation.) Succeeding lines represent an observed event or complement events. At the beginning of a line, specify the number of times the event observed. For an observed event, it should be positive integer, and for a complement event it should be zero. Only one observed event is permitted for each event. After the number of observation, enumerate the names of activated features for an event. Each feature must be defined in a model file. If you specified a feature not found in a model file, it would be an error. The value of a feature function can be specified following the feature name. As in the above example, the feature value is specified following a colon (:). When omitted, it will be 1.
Each event description is separated by blank lines. Note that a line with only comments is also treated as a blank line.
First, you must prepare a file specified as FIXMAP_FILE, which is used to map properties specified in event files to feature names. FIXMAP_FILE looks as follows.
[label name] [label name] [label name] ... [property name] [label name] [feature name] [label name] [feature name] ... [property name] [label name] [feature name] [label name] [feature name] ... [property name] [label name] [feature name] [label name] [feature name] ... ...
The format of an event file is as follows.
[label name] 1 [property] [property] [property] ... [label name] 1 [property] [property] [property] ... [label name] 1 [property] [property] [property] ... ...As [property], you can write
[property name]or
[property name]:[property value]if FEATURE_TYPE is integer or real.
The first line of FIXMAP file enumerates the set of labels used for classification. Each of the other lines is a definition of a property, and specifies the name of the feature that will become active when the property and the label co-occur.
Each line of an event file corresponds to an observed event. The line starts with the name of an observed label and the frequency of observation. The rest of the line enumerates the observed properties. Amis generates an observed event automatically, by mapping the pairs of the observed label and enumerated properties to the features specified in FIXMAP_FILE. The complement events are generated the same way, using the labels other than the observed label.
[event name] [frequency] [feature] [feature] [feature] ... [disjunctive node]BNF-like representation of [disjunctive node] is as follows:
[disjunctive node] := [reference to some node name] | { [node name] [conjunctive node] [conjunctive node] ... } [disjunctive node] := [reference to some node name] | ( [node name] [feature] [feature] ... [disjunctive node] [disjunctive node] ... )You must put spaces before and after curly braces and round brackets, so that they are not treated as a part of node names or feature names.
A real example looks like this:
event_1 2 feature1:2 feature2:3 feature3 { dnode_1 ( node_1 feature1:2 { dnode_2 ( node_2 feature2:3 ) ( node_3 ) } { dnode_3 $node_2 ( node_4 feature3 ) } ) } event_2 1 feature2:3 { dnode_1 ( node_1 feature1 ) ( node_2 { dnode_2 ( node_3 feature2:3 ) ( node_4 feature3 ) } ) } ...
As in the Amis format, a blank line separates each event description. In the AmisTree format, an event is represented with three lines. The first line specifies the name of an event. In the second line, you show the number of times of the observed event and enumerate activated features of an observed event. In the above example, event_1 is observed twice, and event_2 once. As in the Amis format, specify the name of a feature together with its value. The third line represents an observed event and complement events in a feature forest. Disjunctive nodes are represented with curly braces. Between the curly braces, the name of a node is first specified, and conjunctive nodes follow. Node names must be unique in each event, but you can use '_' for the nodes which are never refered to. Conjunctive nodes are represented with round brackets. Between the round brackets, the name of a node is first specified, and activated features follow. Feature descriptions are the same as Amis format. You can also specify disjunctive nodes as daughter nodes. Node names are used to represent structure-sharing. Already appeared nodes can be refered by "$" followed by the node name. In event_1 in the above example, $node2 represents the sharing with node2 already appeared. By using node sharing and pack the feature forest smaller, the computational complexity reduces, and the computation will be accelerated. You can use any characters except for special characters for node names and feature names.
From the feature forests, we can extract a set of feature lists by the following mutual-recursive procedures.
Amis outputs pairs of feature and parameter. The output format is the same as a model file. Since the output format is the same as a model file, the output file can be reused as an input of the new computation. That is, we can further progress the parameter estimation given already estimated data.
Parameters a_i corresponding to feature functions are output, and we can compute the probability of an unknown event by the product of a_i for all activated features.