Amis short tutorial

June 8th, 2006
Kazuhiro Yoshida
Department of Computer Science, University of Tokyo
kyoshida AT is.s.u-tokyo.ac.jp
Japanese version

Maximum entropy models

A maximum entropy model gives a probability p(x) of an event, by representing an event x as a bundle of feature functions (or features in short) f_i. f_i are the functions that take event and return real values, and each i corresponds to one characteristic. Most usually, f_i(x) gives how many times the characteristic f_i is observed in an event x

Given an event x and feature functions f_i, a maximum entropy model gives a probability by the following formula.

p(x) = 1/Z exp( sum( l_i * f_i(x) ) = 1/Z prod( a_i^f_i(x) )
l_i (lambda) or a_i (alpha) is a model parameter, and intuitively, it represents the weight of a feature f_i. Z is a normalizing factor for letting the summation of probabilities p(x) be 1. In other words, maximum entropy model gives each choice in an event the probability which is proportional to the product of a_i that correspond to the features activated in that choice.

Maximum entropy method estimates the parameters of the above model to maximize the likelihood of training data. Smoothing of the estimation by means of maximum a posteriori method is also widely used to avoid over-training. Estimation algorithms of Amis which contain "MAP" in their names support Gaussian priors, and "BC" exponential priors.

The most typical procedure for using Amis to solve real problems is as follows.

Using AmisFix format

AmisFix format is used to solve the problems which can be treated as a task to assign labels to targets of classification, such as document classification. Let us consider the problem of predicting the part-of-speech (POS) of a word. In this case, we can regard each word in documents as an event, and the word and its correct POS can be given to amis as training data. (Let us assume there are only three POSs, N[oun], V[erb] and A[djective], to simplify the problem.) Let us use the following as training data.
events -> N
parameter -> N
goes -> V
killed -> V
green -> A
faster -> A
The first line means that "we observed a word events and its POS was N." Then, if we are to use "surface string of the word", "significant suffixes" and "whether the word is shorter than 6 letters" as features, we will prepare example.fixmap which looks like as follows.
N V A        # Target labels
events     N f_events_N     V f_events_V     A f_events_A
parameter  N f_parameter_N  V f_parameter_V  A f_parameter_A
goes       N f_goes_N       V f_goes_V       A f_goes_A
killed     N f_killed_N     V f_killed_V     A f_killed_A
green      N f_green_N      V f_green_V      A f_green_A      
faster     N f_faster_N     V f_faster_V     A f_faster_A
-s         N f_-s_N         V f_-s_V         A f_-s_A
-ed        N f_-ed_N        V f_-ed_V        A f_-ed_A
-er        N f_-er_N        V f_-er_V        A f_-er_A
longer-5   N f_longer-5_N   V f_longer-5_V   A f_longer-5_A
shorter-6  N f_shorter-6_N  V f_shorter-6_V  A f_shorter-6_A
For instance, the feature f_goes_N takes the value 1 if the word "goes" appears as N and 0 otherwise. (In this example, example.fixmap does not have significant information, because feature names are straightforward combinations of label names and property names. We will see an example of informative fixmap later.) In example.model, we list the feature names which appear in fixmap (the names beginning with 'f' in this example), followed by its initial values. Theoretically, initial values do not affect the optimal values, and we usually let them be 1.0.
f_events_N     1.0
f_events_V     1.0
f_events_A     1.0
f_parameter_N  1.0
f_parameter_V  1.0
f_parameter_A  1.0
f_goes_N       1.0
f_goes_V       1.0
f_goes_A       1.0
...
An event file is as follows.
N 1 events -s longer-5   # A word which has properties "events" "-s" "longer-5" happened to be N
N 1 parameter longer-5   # A word which has properties "parameter" "longer-5" happened to be N
V 1 goes -s shorter-6
V 1 killed -ed longer-5
A 1 green shorter-6
A 1 faster -er longer-5
Now we can run amis
% amis example.conf
and get the following example.output.
f_events_N	4.179041e+00
f_events_V	3.470270e-01
...
f_-s_A	2.631215e-01
f_-ed_N	3.289747e-01
f_-ed_V	5.920547e+00
f_-ed_A	5.134234e-01
...
f_longer-5_A	6.951490e-01
f_shorter-6_N	2.592765e-01
f_shorter-6_V	1.482556e+00
f_shorter-6_A	2.601511e+00
To predict the POS of the word "died", which has the properties "-ed" and "shorter-6", we can get the weights of each choice by consulting example.fixmap and example.output: By normalizing these, we obtain p(N|died)=0.084, p(V|died)=0.86, p(A|died)=0.13 and the POS of died is predicted to be V. (If we just want to predict the POS and the probability values are out of concern, normalization is unnecessary.)

In the above example, different labels (i.e. POSs) are always accompanied by different features, but we sometimes want to make several labels share same features. For example, the suffix "-s" is useful to eliminate the choice of "A", but it should be less significant to determine whether it is N or V. To capture the situation, we can add a property "-s-NV" to our FIXMAP_FILE as follows:

-s-NV      N f_-s-NV_NorV   V f_-s-NV_NorV   A f_-s-NV_A
note that co-occurrence with both N and V are mapped to the same feature. (Event and model files should correspondingly be modified.)

Using Amis format

Some problems does not take the form of selecting one from a fixed number of choices. For example, let us consider the problem of selecting the main verb of a given sentence. This problem can be solved by developing a binary classifier which determine whether the target word is the main verb or not, but it could more naturally be solved by a classifier which selects the main verb where the possible choices are the words in a sentence. However, such classifiers does not conform to AmisFix event files, because the number of the choices depends on the sentence length. We can use Amis format in such a situation. (AmisFix format event files always have equivalent Amis format event files.) Let the following be the training data (the words in [] are main verbs of each sentence):
I [killed] a boy
He [died]
I [promised] to kill him
His brother [killed] me
If the POS of each word is also given as a part of the training data, we can use "the POS of the word", "the POS of the previous word" and "whether a verb appeared before the word" as features, and create the following model and event files.
POS-DT         1.0  # determiners
POS-N          1.0  # nouns
POS-PN         1.0  # pronouns
POS-P          1.0  # prepositions
POS-V          1.0  # verbs
Prev-DT        1.0
Prev-N         1.0
Prev-None      1.0  # First word and no previous word
Prev-PN        1.0
Prev-P         1.0
Prev-V         1.0
Prev_Verb-NO   1.0  # Whether a verb appeared before
Prev_Verb-YES  1.0
event1
0 POS-PN Prev-None Prev_Verb-NO  # I
1 POS-V Prev-PN Prev_Verb-NO     # killed
0 POS-DT Prev-V Prev_Verb-YES    # a
0 POS-N Prev-DT Prev_Verb-YES    # boy

event2
0 POS-PN Prev-None Prev_Verb-NO  # He
1 POS-V Prev-PN Prev_Verb-NO     # died

event3
0 POS-PN Prev-None Prev_Verb-NO  # I
1 POS-V Prev-PN Prev_Verb-NO     # promised
0 POS-P Prev-V Prev_Verb-YES     # to
0 POS-V Prev-P Prev_Verb-YES     # kill
0 POS-PN Prev-V Prev_Verb-YES    # him

event4
0 POS-PN Prev-None Prev_Verb-NO  # His
0 POS-N Prev-PN Prev_Verb-NO     # brother
1 POS-V Prev-N Prev_Verb-NO      # killed
0 POS-PN Prev-V Prev_Verb-YES    # me
We can run amis by
% amis example.conf
and obtain example.output. The calculation of probabilities is also similar to the case of AmisFix format.

Using AmisTree format

Let us solve the problem of disambiguating CFG parsing results. We can use an Amis format event file for the training, because disambiguation of parse trees is a task of selecting the correct parse tree from the parse trees licensed by the grammar.

For example, let us think that the given CFG provides the following four parses for a sentence "time flies like an arrow", while the first one is correct.

Let us associate a feature named "f-A-B_C" to each subtree of the shape "(A (B ...) (C ...))", (It is similar to the scoring scheme of PCFG, where such a subtree is given a score P(BC|A).) In addition, the feature that represents the parent symbol "f-A" is also used, expecting better generalization. The above event can be written in Amis format as follows:

event1
0 f-S f-S-NP_VP f-NP f-NP-N_N f-N f-N_time f-N f-N_flies f-VP f-VP-V_NP f-V f-V_like f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow 
0 f-S f-S-S_PP f-S f-S-V_N f-V f-V_time f-N f-N_flies f-PP f-PP-P_NP f-P f-P_like f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow 
0 f-S f-S-V_NP f-V f-V_time f-NP f-NP-N_PP f-N f-N_flies f-PP f-PP-P_NP f-P f-P_like f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow 
1 f-S f-S-N_VP f-N f-N_time f-VP f-VP-V_PP f-V f-V_flies f-PP f-PP-P_NP f-P f-P_like f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow

This case is not problematic because there are only four possible parse trees. However, if we are to develop a practical CFG parser, the ambiguity of the grammars should be higher, and the training data should contain longer sentences which have large number of possible analyses. In such a situation, Amis format has difficulties because the number of observed and complement events will be intractably large number, resulting in an enormous event file and computational cost.

But close look at the above event reveals its redundancies: for example, the set of feature "f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow" appear in more than one analyses. When observed and complement events are sharing features like this, AmisTree format could be used to reduce the size of event files and the cost of computation by an order of magnitude. In this case, the above Amis format event is equivalent to the following AmisTree format event.

event1
1 f-S f-S-N_VP f-N f-N_time f-VP f-VP-V_PP f-V f-V_flies f-PP f-PP-P_NP f-P f-P_like f-NP f-NP-D_N f-D f-D_an f-N f-N_arrow 
{ _ ( _ f-S { _ ( _ f-S-NP_VP { _ ( _ f-NP { _ ( _ f-NP-N_N { d_4 ( _ f-N { _ ( _ f-N_time ) } ) } { d_6 ( _ f-N { _ ( _ f-N_flies ) } ) } ) } ) } { _ ( _ f-VP { _ ( _ f-VP-V_NP { _ ( _ f-V { _ ( _ f-V_like ) } ) } { d_12 ( _ f-NP { _ ( _ f-NP-D_N { _ ( _ f-D { _ ( _ f-D_an ) } ) } { _ ( _ f-N { _ ( _ f-N_arrow ) } ) } ) } ) } ) } ) } ) ( _ f-S-S_PP { _ ( _ f-S { _ ( _ f-S-V_N { d_20 ( _ f-V { _ ( _ f-V_time ) } ) } $d_6 ) } ) } { d_24 ( _ f-PP { _ ( _ f-PP-P_NP { _ ( _ f-P { _ ( _ f-P_like ) } ) } $d_12 ) } ) } ) ( _ f-S-V_NP $d_20 { _ ( _ f-NP { _ ( _ f-NP-N_PP $d_6 $d_24 ) } ) } ) ( _ f-S-N_VP $d_4 { _ ( _ f-VP { _ ( _ f-VP-V_PP { _ ( _ f-V { _ ( _ f-V_flies ) } ) } $d_24 ) } ) } ) } ) }
The above feature forest can be obtained by directly converting a chart which is obtained as a result of dynamic programming parsing algorithms like CKY algorithm.

Nonetheless, this approach has some shortcomings. For example, if we want to use features which cover wider range of parse trees, like the features used in headed PCFGs, the common subtrees among the analyses will become smaller, and the size of the feature forest will be larger. Generally speaking, in AmisTree events, there is a trade-off between the size of an event and the locality of features that can be used.