Sign up & Download
Sign in

Stochastic Attribute-Value Grammars

by Steven Abney
Computational Linguistics (1996)

Abstract

Probabilistic analogues of regular and context-free grammars are well-known in computational linguistics, and currently the subject of intensive research. To date, however, no satisfactory probabilistic analogue of attribute-value grammars has been proposed: previous attempts have failed to define a correct parameter-estimation algorithm. In the present paper, I define stochastic attribute-value grammars and give a correct algorithm for estimating their parameters. The estimation algorithm is adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model parameters, it is necessary to compute the expectations of certain functions under random fields. In the application discussed by Della Pietra, Della Pietra, and Lafferty (representing English orthographic constraints), Gibbs sampling can be used to estimate the needed expectations. The fact that attribute-value grammars generate constrained languages makes Gibbs sampling inapplicable, but I show how a variant of Gibbs sampling, the Metropolis-Hastings algorithm, can be used instead.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Stochastic Attribute-Value Grammars

ar
X
iv
:c
m
p-
lg
/9
61
00
03
v1
2
3
O
ct
1
99
6
Stochastic Attribute-Value Grammars
Steven Abney
University of Tu¨bingen
http://www.sfs.nphil.uni-tuebingen.de/∼ abney/
abney@sfs.nphil.uni-tuebingen.de
Wilhelmstr. 113, 72074 Tu¨bingen, Germany
Abstract
Probabilistic analogues of regular and context-free grammars are well-
known in computational linguistics, and currently the subject of inten-
sive research. To date, however, no satisfactory probabilistic analogue
of attribute-value grammars has been proposed: previous attempts have
failed to define a correct parameter-estimation algorithm.
In the present paper, I define stochastic attribute-value grammars and
give a correct algorithm for estimating their parameters. The estima-
tion algorithm is adapted from Della Pietra, Della Pietra, and Lafferty
[5]. To estimate model parameters, it is necessary to compute the ex-
pectations of certain functions under random fields. In the application
discussed by Della Pietra, Della Pietra, and Lafferty (representing En-
glish orthographic constraints), Gibbs sampling can be used to estimate
the needed expectations. The fact that attribute-value grammars gener-
ate constrained languages makes Gibbs sampling inapplicable, but I show
how a variant of Gibbs sampling, the Metropolis-Hastings algorithm, can
be used instead.
1 Introduction
Stochastic versions of regular grammars and context-free grammars have re-
ceived a great deal of attention in computational linguistics for the last sev-
eral years, and basic techniques of stochastic parsing and parameter estimation
have been known for decades. However, regular and context-free grammars are
widely deemed linguistically inadequate; standard grammars in computational
linguistics are attribute-value grammars of some variety. Before the advent of
statistical methods, regular and context-free grammars were considered too in-
expressive for serious consideration, and even now the reliance on stochastic
versions of the less-expressive grammars is often seen as an expedient necessi-
tated by the lack of an adequate stochastic version of attribute-value grammars.
1
Page 2
hidden
Attempts have been made to extend stochastic models developed for the
regular and context-free cases to attribute-value grammars, but to date with-
out success.1 Brew [1] sketches a probabilistic version of HPSG, but admits
that his way of dealing with re-entrancies in feature structures is problematic.
Eisele [2] attempts to translate stochastic context-free techniques to constraint-
based grammar by assigning probabilities to SLD proof trees. Both Brew and
Eisele propose associating weights with grammar-rule analogues (typed feature
structures in Brew’s case; Horn clauses in Eisele’s case) and setting weights pro-
portional to expected rule frequencies. For want of a standard term, I will call
this the Expected Rule Frequency (ERF) method. Both propose using iterative
reestimation of rule-frequency expectations when dealing with incomplete data
(unannotated corpora), along the lines of the EM algorithm.
The attempt is ultimately unsuccessful. The ERF method is provably cor-
rect for the context-free case, but it fails in the presence of context dependencies,
as will be discussed below. Both Brew and Eisele recognize that applying the
ERF method has deficiencies. Eisele in particular identifies an important symp-
tom that indicates that something has gone amiss: the grammar induced by
the EM algorithm defines a probability distribution over trees that is not in
accordance with their frequency in the training corpus. Moreover, Eisele recog-
nizes that this problem arises only where there are context dependencies. That
such dependencies lead to problems is not surprising, given the independence
assumptions underlying Eisele’s model, but he is not able to explain why they
manifest themselves in the way they do, nor what can be done to address the
problem.
Now in fact solutions to the context-sensitivity problem have long been
known, and are the subject of continuing study, in the image processing field
and in related areas of statistics. The models of interest are known as ran-
dom fields. Random fields can be seen as a generalization of Markov chains and
stochastic branching processes. Markov chains can be seen as stochastic versions
of regular grammars (Hidden Markov Models are in turn stochastic functions
of Markov chains) and random branching processes are stochastic versions of
context-free grammars. The evolution of a Markov chain describes a line, in
which each stochastic choice depends only on the state at the immediately pre-
ceding time-point. The evolution of a random branching process describes a
tree in which a finite-state process may spawn multiple child processes at the
next time-step, but the number of processes and their states depend only on
the state of the unique parent process at the preceding time-step. In particular,
1I confine my discussion here to Brew and Eisele because they aim to describe parametric
models of probability distributions over the languages of constraint-based grammars, and to
estimate the parameters of those models. Other authors have assigned weights or preferences
to constraint-based grammars but not discussed parameter estimation. One approach of the
latter sort that I find of particular interest is that of Stefan Riezler [6], who describes a weighted
logic for constraint-based grammars that characterizes the languages of the grammars as fuzzy
sets. This interpretation avoids the need for normalization that Brew and Eisele face, though
parameter estimation still remains to be addressed.
2
Page 3
hidden
stochastic choices are independent of other choices at the same time-step: each
process evolves independently. If we permit re-entrancies, that is, if we permit
processes to re-merge, we generally introduce context-sensitivity. In order to
re-merge, processes must generally be “in synch,” which is to say, they cannot
evolve in complete independence of one another. Random fields are a particular
class of multi-dimensional random processes, that is, processes corresponding
to probability distributions over an arbitrary graph. They were originally stud-
ied by Gibbs, nearly a hundred years ago, as a model for statistical mechanics,
and the general family of probability distributions involved is still known by his
name.
To my knowledge, the first application of random fields to natural language
was by Mark et al. [3]. The problem of interest was how to combine a stochastic
context-free grammar with n-gram language models. The resulting structures,
e.g., (1), obviously involve re-entrancies and context-sensitivity.
(1)
was
no response
VP
NP
S
there
NP
It was clear at that time that a similar approach ought to succeed for general
attribute-value grammars, but the issue was not pursued.
Recent work by Della Pietra, Della Pietra, and Lafferty [5] (henceforth,
DDL) also applies random fields to natural language processing. The application
they consider is the induction of English orthographic constraints—inducing a
grammar of possible English words. The authors describe an algorithm for
selecting informative properties of words to construct a random field, and for
setting the parameters of the field optimally for a given set of properties, to
model an empirical word distribution.
The DDL algorithms require the computation of the expectations, under
random fields, of certain characteristic functions. In general, computing these
expectations involves summing over all configurations (all possible character
sequences, in the orthography application), which is not possible when the con-
figuration space is large. Instead, DDL use Gibbs sampling to estimate the
needed expectations.
The orthography application cannot be immediately converted into a means
of equipping attribute-value grammars with probabilities. Any labelling of a
finite linear graph2 with ASCII characters yields a possible (though not neces-
sarily probable) English word, and this unconstrainedness is essential for the use
of Gibbs sampling. By contrast, the set of dags admitted by an attribute-value
2To be precise, DDL use closed linear graphs—i.e., polygons.
3
Page 4
hidden
grammar G is highly constrained—most of the time, relabelling a dag admitted
by G does not yield a new dag admitted by G. Gibbs sampling is not applicable.
However, I will show that a variant of Gibbs sampling, the Metropolis-Hastings
algorithm, can be used. Indeed, we can use a random branching process much
like Brew’s or Eisele’s to supply the so-called proposal matrix for the Metropolis-
Hastings algorithm.
In this way, we can assign probabilities to the classes of dags admitted by
attribute-value grammars. We can use these probabilities to disambiguate sen-
tences (by selecting the most-probable parse), and we can give a parameter-
estimation algorithm that is correct, in the sense that, if we generate a training
corpus of size n from a model M , and then estimate parameters from the train-
ing corpus to yield a model-estimate Mˆn, then Mˆn converges to M as n→∞.
Acknowledgements
This work has greatly profited from the comments, criticism, and suggestions
of a number of people, including John Lafferty, Stanley Peters, Hans Uszko-
reit, and other members of the audience for talks I gave at Saarbru¨cken and
Tu¨bingen. Michael Miller and Kevin Mark introduced me to random fields as a
way of dealing with context-sensitivities in language, and I have been fascinated
ever since. I would especially like to thank Mark Light and Stefan Riezler for
extended discussions of the issues addressed here and helpful criticism on points
of presentation. All responsibility for flaws and errors of course remains with
me.
2 Stochastic Context-Free Grammars
Let us begin by examining stochastic context-free grammars and asking why
the “obvious” generalization to attribute-value grammars fails. A point of ter-
minology: I will use the term grammar to refer to an unweighted grammar, be
it a context-free grammar or attribute-value grammar. The combination of a
grammar and weights (later, also properties) I will refer to as a model. (Occa-
sionally I will use model to refer to the weights themselves, or the probability
distribution they define.)
Throughout we will use the following stochastic context-free grammar for
illustrative purposes. Let us call the underlying grammar G1 and the grammar
equipped with weights as shown, M1:
4
Page 5
hidden
(2) 1. S → A A β1 = 1/2
2. S → B β2 = 1/2
3. A → a β3 = 2/3
4. A → b β4 = 1/3
5. B → a a β5 = 1/2
6. B → b b β6 = 1/2
The probability of a given tree is computed as the product of probabilities of
rules used in it. For example:
(3)
S
A
a
A
a
β1
β3 β3
Let x be tree (3) and let q1 be the probability distribution over trees defined by
model M1. Then:
(4)
q1(x) = β1 · β3 · β3
= 12 · 23 · 23 = 29
In parsing, we use the probability distribution q1(x) defined by model M1 to
disambiguate: the grammar assigns some set of trees {x1, . . . , xn} to a sentence
σ, and we choose that tree xi that has greatest probability q1(xi). For example,
G1 assigns two parses to the sentence aa: tree (3) above and tree (5):
(5) S
B
a a
The probability of tree (3) is 2/9, as we have seen. The probability of tree (5)
is β2β5 = 1/2 · 1/2 = 1/4. Since 1/4 > 2/9, a stochastic parser for M1 should
return tree (5) on input aa.
The issue of efficiently computing the most-probable parse for a given sen-
tence has been thoroughly addressed in the literature. The standard parsing
techniques can be applied as is to the random-field models to be discussed be-
low, so I simply refer the reader to the literature. Instead, I concentrate on
5
Page 6
hidden
parameter estimation, which for attribute-value grammars cannot be accom-
plished by standard techniques.
By parameter estimation we mean determining values for the weights β. In
order for a stochastic grammar to be useful, we must be able to compute the
correct weights, where by correct weights we mean the weights that best account
for a training corpus. The degree to which a given set of weights account for
a training corpus is measured by the similarity between the distribution qβ(x)
determined by the weights β and the distribution of trees x in the training
corpus.
2.1 The Goodness of a Model
The distribution determined by the training corpus is known as the empirical
distribution. For example, suppose we have a training corpus containing twelve
trees of the following four types from L(G1):
(6)
S
A
a
A
a
S
A
b
A
b
S
a a
B
S
b b
B
4x 3x2x 3x = 12
p =~ 4/12 2/12 3/12 3/12
x1 x2 x3 x4
c =
If ci is the count of how often the i-th tree (type) appears in the corpus, then
p˜(xi) =
ci

j cj
In comparing a distribution q to the empirical distribution p˜, we shall actu-
ally measure dissimilarity rather than similarity. Our measure for dissimilarity
of distributions is the Kullback-Leibler distance, defined as:
(7)
D(p˜||q) =

x
p˜(x) ln p˜(x)q(x)
The distance between p˜ and q at point x is the log of the ratio of p˜(x) to
q(x). The overall distance between p˜ and q is the average distance, where the
averaging is over tree (tokens) in the corpus; i.e., point distances ln p˜(x)/q(x)
are weighted by p˜(x) and summed.
For example, let q1 be, as before, the distribution determined by model M1.
The following table shows q1, p˜, the ratio q1(x)/p˜(x), and the weighted point
6
Page 7
hidden
distance p˜(x) ln(p˜(x)/q1(x)). The sum of the fourth column is the Kullback-
Leibler distance D(p˜||q) between p˜ and q1. The third column contains q1(x)/p˜(x)
rather than p˜(x)/q1(x) so that one can see at a glance whether q1(x) is too large
(q1(x)/p˜(x) > 1) or too small (< 1).
(8) q1 p˜ q1/p˜ p˜ ln(p˜/q1)
x1 2/9 1/3 0.67 0.14
x2 1/18 1/6 0.33 0.18
x3 1/4 1/4 1.00 0.00
x4 1/4 1/4 1.00 0.00
0.32
The total distance D(p˜||q1) = 0.32.
One set of weights is better than another if its distance from the empirical
distribution is less. For example, let us consider a different set of weights for
grammar G1. Let M ′ be G1 with weights (1/2, 1/2, 1/2, 1/2, 1/2, 1/2), and let
q′ be the probability distribution determined by M ′. Then the computation of
the Kullback-Leibler distance is as follows:
(9) q′ p˜ q′/p˜ p˜ ln(p˜/q′)
x1 1/8 1/3 0.38 0.33
x2 1/8 1/6 0.75 0.05
x3 1/4 1/4 1.00 0.00
x4 1/4 1/4 1.00 0.00
0.38
The fit for x2 improves, but that is more than offset by a poorer fit for x1.
The distribution q1 is a better distribution than q′, in the sense that q1 is more
similar (less dissimilar) to the empirical distribution than q′ is.
This particular measure of goodness of a set of weights has a number of nice
properties. For one thing, it is not hard to show that the distribution closest to
the empirical distribution is identically the maximum likelihood distribution.
Another reason for adopting the definition of goodness in terms of Kullback-
Leibler distance is the following. Suppose Nature secretly chooses some set of
weightsM forG1. These are the true weights; they define the true distribution q.
Nature then generates trees at random from M in accordance with q. Let p˜n be
the empirical distribution determined by the first n trees that Nature generates.
A parameter-setting method must choose a model (a set of weights) Mˆn given
p˜n, for each n. A parameter-setting method is correct if it converges to M ,
the true model. The sequence of hypotheses Mˆ1, Mˆ2, . . . defining distributions
qˆ1, qˆ2, . . . is said to converge to M (defining distribution q) just in case, for
all tolerances ǫ, there is some point n such that D(q||qˆn′) < ǫ for all n′ > n.
It can be shown that D(q||p˜n) converges to 0; that is, limn→∞ p˜n = q. If
a parameter-setting method returns the model Mˆn that minimizes D(p˜n||qˆn),
then limn→∞ qˆn = limn→∞ p˜n, if the limiting distribution for p˜n is generable
7
Page 8
hidden
by any model with underlying grammar G1. Since q is generable by such a
grammar, and q is the limit distribution for p˜n, it follows that q is also the limit
distribution for qˆn, and the method is correct.
Note that the model Mˆ that minimizes the distance D(q||qˆ) is M itself, and
D(q||q) = 0. This does not mean, however, that D(p˜n||qˆn) = 0 for the model
minimizing D(p˜n||qˆn). The empirical distributions p˜n converge to q, but do
not necessarily equal q. Intuitively, the relative frequency of any given tree
converges to its true probability, but need not be precisely its true probability,
even in very large corpora.
2.2 The ERF Method
For stochastic context-free grammars, it can be shown that the Expected Rule
Frequency (ERF) method mentioned in the introduction always yields the best
model for a given training corpus. To define the ERF method, we require a
bit of terminology and notation. With each rule i in a stochastic context-free
grammar is associated a weight βi and a function fi(x) that returns the number
of times rule i is used in the derivation of tree x. For example, consider tree
(3), repeated here as (10):
(10)
S
A
a
A
a
β1
β3 β3
Rule 1 is used once and rule 3 is used twice; accordingly f1(x) = 1, f3(x) = 2,
and fi(x) = 0 for i ∈ {2, 4, 5, 6}.
The expectation of a function over a probability space (for each i, fi is such
a function) is simply the average value of the function. We use the notation p[f ]
to represent the expectation of f under probability distribution p. It is defined
as:
p[f ] =

x
p(x)f(x)
The ERF method instructs us to choose the weight for rule i proportional
to the average frequency of rule i in the corpus. That is:
βi ∝ p˜[fi]
Algorithmically, we compute the expectation of each rule’s frequency, and nor-
malize among rules with the same lefthand side. For example, consider corpus
8
Page 9
hidden
(6). The expectation of each rule frequency fi is a sum of terms p˜(x)fi(x).
These terms are shown for each tree, in the following table.
S

A
A
S

B
A

a
A

b
B

a
a
B

b
b
p˜ p˜f1 p˜f2 p˜f3 p˜f4 p˜f5 p˜f6
x1 [S [A a] [A a]] 1/3 1/3 2/3
x2 [S [B a a]] 1/6 1/6 2/6
x3 [S [A b] [A b]] 1/4 1/4 1/4
x4 [S [B b b]] 1/4 1/4 1/4
p˜[f ] = 1/2 1/2 2/3 1/3 1/4 1/4
β = 1/2 1/2 2/3 1/3 1/2 1/2
For example, in tree x1, rule 1 is used once and rule 3 is used twice. The
empirical probability of x1 is 1/3, so x1’s contribution to p˜[f1] is 1/3 · 1, and
its contribution to p˜[f3] is 1/3 · 2. The weight βi is obtained from p˜[fi] by
normalizing among rules with the same lefthand side. For example, the expected
rule frequencies p˜[f1] and p˜[f2] of rules with lefthand side S already sum to 1, so
they are adopted without change as β1 and β2. On the other hand, the expected
rule frequencies p˜[f5] and p˜[f6] for rules with lefthand side B sum to 1/2, not 1,
so they are doubled to yield weights β5 and β6. It should be observed that the
resulting weights are precisely the weights of model M1.
It can be proven that the ERF weights are the best weights for a given
grammar, in the sense that they define the distribution that is most similar
to the empirical distribution. That is, if β are the ERF weights (for a given
grammar), then D(p˜||qβ) < D(p˜||qβ′) for all sets of weights β′ 6= β.
As noted earlier, one might expect the best weights to yield D(p˜||q) = 0, but
such is not the case. We have just seen, for example, that the best weights for
grammar G1 yield distribution q1, yet D(p˜||q1) = 0.32 > 0. A close inspection of
the distance calculation (8) reveals that q1 is sometimes less than p˜, but never
greater than p˜. Could we improve the fit by increasing q1? For that matter,
how can it be that q1 is never greater than p˜? As probability distributions, q1
and p˜ should have the same total mass, namely, 1. Where is the missing mass
for q1?
The answer is of course that q1 and p˜ are probability distributions over
L(G), but not all of L(G) appears in the corpus. Two trees are missing, and
they account for the missing mass. These two trees are:
(11) S
A
a
A
b
S
A
b
A
a
9
Page 10
hidden
Each of these trees have probability 0 according to p˜ (hence they can be ignored
in the distance calculation), but probability 1/9 according to q1.
Intuitively, the problem is this. The distribution q1 assigns too little weight
to trees x1 and x2, and too much weight to the trees of (11); call them x5 and x6.
Yet exactly the same rules are used in x5 and x6 as are used in x1 and x2. Hence
there is no way to increase the weight for trees x1 and x2, improving their fit
to p˜, without simultaneously increasing the weight for x5 and x6, making their
fit to p˜ worse. The distribution q1 is the best compromise possible.
To say it another way, our assumption that the corpus was generated by a
context-free grammar means that any context dependencies in the corpus must
be accidental, the result of sampling noise. There is indeed a dependency in
corpus (6): in the trees where there are two A’s, the A’s always rewrite the
same way. If corpus (6) was generated by a stochastic context-free grammar,
then this dependency is accidental.
This does not mean that the context-free assumption is wrong. If we generate
twelve trees at random from q1, it would not be too surprising if we got corpus
(6). More extremely, if we generate a random corpus of size 1 from q1, it is quite
impossible for the resulting empirical distribution to match the distribution q1.
But as the corpus size increases, the fit between p˜ and q1 becomes ever better.
3 Attribute-Value Grammars
But what if the dependency in corpus (6) is not accidental? What if we wish to
adopt a grammar that imposes the constraint that both A’s rewrite the same
way? We can impose such a constraint by using an attribute-value grammar.
Consider the following grammar, in which rewrite rules are now represented as
feature structures. Let us call this grammar G2:
(12) 1.






S
1
[ A
1 1
]
2
[ A
1 1
]






2.
[
S
1 [B]
]
3.
[
A
1 a
]
4.
[
A
1 b
]
5.
[
B
1 a
]
6.
[
B
1 b
]
The language L(G2) is a set of dags, namely:
(13) x1 x2 x3 x4
S
A
a
A
S
A
b
A
S
a
B
S
b
B
10
Page 11
hidden
(The edges of the dags should actually be labelled with 1’s and 2’s, but I have
suppressed the edge labels for the sake of perspicuity.)
3.1 AV Grammars and The ERF Method
Now we face the question of how to attach probabilities to grammar G2. The
approach followed by Brew and Eisele is basically as follows.3 Associate a weight
with each of the six “rules” of grammar G2. For example, let M2 be the model
consisting of G2 plus weights (β1, . . . , β6) = (1/2, 1/2, 2/3, 1/3, 1/2, 1/2). The
weight assigned to a tree x is then (as before) the product of the weights of the
rules used in x. For example, the weight q˘2(x1)4 assigned to tree x1 of (13) is
2/9, computed as follows:
(14)
S
A A
a
β1
β3 β3
x =1
Rule 1 is used once and rule 3 is used twice; hence q˘2(x1) = β1β3β3 = 1/2 · 2/3 ·
2/3 = 2/9.
Observe that q˘2(x1) = β1β23 , which is to say, β
f1(x1)
1 β
f3(x1)
3 . Moreover, since
β0 = 1, it does not hurt to include additional factors βfi(x1)i for those i where
fi(x1) = 0. That is, we can define q˘β corresponding to weights β = (β1, . . . , βn)
generally as:
q˘β(x) =
n

i=1
βfi(x)i
Now let us consider how to estimate weights. Brew and Eisele propose using
the ERF method, as in the context-free case. To be sure, Brew and Eisele are
more concerned about the case in which the training corpus consists of sentences
alone, rather than parses (dags), and they concentrate on the application of
the EM algorithm to estimate rule-frequency expectations in the absence of
complete information. But their basic method is the ERF method: rule weights
βi are set in accordance with the formula βi ∝ p˜[fi], under the constraint
that the weights for rules with the same lefthand side sum to 1. The EM
algorithm enters the picture only as a means of estimating p˜[fi] when it cannot
be determined by simple counting.
3To be precise, neither Brew nor Eisele adopt the attribute-value framework discussed here,
but the approaches they take in the related frameworks they do adopt are clearly analogous
to the one I describe here.
4The reason for the ‘ ˘ ’ will be made clear shortly.
11
Page 12
hidden
To illustrate, let us assume a corpus distribution for the dags (13) analogous
to the distribution in (6):
(15)
x1 x2 x3 x4
p˜ = 1/3 1/6 1/4 1/4
Using the ERF method, we estimate rule weights as follows:
(16)
p˜ p˜f1 p˜f2 p˜f3 p˜f4 p˜f5 p˜f6
x1 1/3 1/3 2/3
x2 1/6 1/6 2/6
x3 1/4 1/4 1/4
x4 1/4 1/4 1/4
p˜[f ] = 1/2 1/2 2/3 1/3 1/4 1/4
β = 1/2 1/2 2/3 1/3 1/2 1/2
This table is identical to the one given earlier in the context-free case. We arrive
at the same weights we considered above for the AV grammar G2, yielding the
distribution q˘2.
3.2 Why the ERF Method Fails
But at this point a problem arises: q˘2 is not a probability distribution. Unlike
in the context-free case, the four trees in (13) constitute the entirety of L(G).
This time, there are no missing trees to account for the missing probability mass.
There is an obvious “fix” for this problem, as Brew and Eisele observe: we can
simply normalize q˘2. (This, by the way, is the reason for the ‘ ˘ ’ in ‘q˘2’—it is
meant to indicate that q˘2 is an “unnormalized” probability distribution.) That
is, for the AV-grammar case, we must define the distribution qβ corresponding
to the weights β as:
qβ(x) =
1
Z q˘β(x)
where Z is a normalizing constant defined as:
Z =

y∈L(G)
q˘β(y)
In particular, for the ERF weights given in (16), we have Z = 2/9 + 1/18 +
1/4 + 1/4 = 7/9. Dividing q˘2 by 7/9 yields the ERF distribution:
(17)
x1 x2 x3 x4
q2(x) = 2/7 1/14 9/28 9/28
12
Page 13
hidden
On the face of it, then, we can transplant the methods we used in the context-
free case to the AV case and the only problem that arises (q˘2 not summing to
1) has an obvious fix (normalization). However, something has actually gone
very wrong. The theorem according to which the ERF method yields the best
weights makes certain assumptions that we inadvertently violated by changing
L(G) and re-apportioning probability via normalization. In point of fact, we can
easily see that the ERF weights (16) are not the best weights for our example
grammar. Consider the alternative model M∗ given in (18), defining probability
distribution q∗:
(18)
[S A A] [S B] [A a a] [A b b] [B a a] [B b b]
β1 = β2 = β3 = β4 = β5 = β6 =
3+2

2
6+2

2
3
6+2

2

2
1+

2
1
1+

2
1
2
1
2
These weights are proper, in the sense that weights for rules with the same
lefthand side sum to one. The reader can verify that q˘∗ sums to Z = 3+

2
3 and
that q∗ is:
(19) S
A
a
A
S
A
b
A
S
a
B
S
b
B
x1 x2 x3 x4
q∗(x) = 1/3 1/6 1/4 1/4
That is, q∗ = p˜. Comparing q2 (the ERF distribution) and q∗ to p˜, we observe
that D(p˜||q2) = 0.07 but D(p˜||q∗) = 0.
In short, in the AV case, the ERF weights do not yield the best weights.
This means that the ERF method does not converge to the correct weights as
the corpus size increases. If there are genuine dependencies in the grammar,
the ERF method converges systematically to the wrong weights. Fortunately,
there are methods that do converge to the right weights. These are methods
that have been developed for random fields.
4 Random Fields
A random field defines a probability distribution over a set of labelled graphs
Ω called configurations. In our case, the configurations are the dags generated
by the grammar, i.e., Ω = L(G).5 The weight assigned to a configuration is the
product of the weights assigned to configuration properties.6 That is:
5Those familiar with random fields will recognize that identifying configurations with the
dags of L(G) is not entirely unproblematic. For one thing, configurations are standardly taken
to be labelings over a fixed graph, not graphs with varying topologies. For another thing, the
configuration space is standardly taken to be finite, not countably infinite, as L(G) may be.
These issues will be dealt with in the course of discussion.
6The standard term in the random-fields literature is feature; I use the term property to
avoid confusion with feature in the sense of an attribute plus value.
13
Page 14
hidden
q˘(x) =

i
βfi(x)i
where βi is the weight for property i and fi(x) is the frequency of occurence of
property i in configuration x. The probability of a configuration is proportional
to its weight, and is obtained by normalizing the weight distribution. That is:
q(x) = 1Z q˘(x)
Z = ∑y∈Ω q˘(y)
If we identify properties of a configuration with the rules used in it, the
random field model is almost identical to the model we considered in the previous
section. There are two important differences. First, we no longer require weights
to sum to one for rules with the same lefthand side. Second, we no longer require
properties to be identical to the rules of the grammar. We use the grammar to
define the set of configurations Ω = L(G), but give ourselves more flexibility in
choosing the properties of dags we would like to use to define the probability
distribution over L(G).
Let us consider an example. Let us continue to assume grammar G2 generat-
ing language (13), and let us continue to assume the empirical distribution (15).
But now rather than taking rule applications—local trees—to be properties, let
us adopt the following two properties:
(20)
1. 2.
A
a
1 B
For purpose of illustration, take property 1 to have weight β1 =

2 and property
2 to have weight β2 = 3/2. The functions f1 and f2 represent the frequencies
of properties 1 and 2, respectively:
(21) S
A
a
A1
1
S
A
b
A
S
a
B2
S
b
B2
f1 = 2 0 0 0
f2 = 0 0 1 1
q˘ =

2 ·

2 1 3/2 3/2 Z = 6
q = 2/6 1/6 (3/2)/6 (3/2)/6
= 1/3 1/6 1/4 1/4
In short, we are able to exactly recreate the empirical distribution using fewer
properties than before. Intuitively, we need only use as many properties as are
necessary to distinguish among trees that have different empirical probabilities.
14
Page 15
hidden
This added flexibility is welcome, but it does make parameter estimation
more involved. Now we must not only choose values for weights, we must also
choose the properties that weights are to be associated with. We would like
to do both in a way that permits us to find the best model, in the sense of
the model that minimizes the Kullback-Leibler distance with respect to the
empirical distribution. Methods for doing both are given in a recent paper by
Della Pietra, Della Pietra, and Lafferty [5].
5 Field Induction
In outline, the DDL algorithm is as follows:
1. Start (t = 0) with the null field (no properties).
2. Property Selection. Consider every property that might be added to
the field qt and choose the best one.
3. Weight Adjustment. Readjust weights for all properties. The result is
a new field qt+1.
4. Iterate until the field cannot be improved.
One has a great deal of flexibility in defining the space of properties. For
the sake of concreteness, let us take properties to be labelled subdags. In step
2 of the algorithm we do not consider every conceivable labelled subdag (there
are simply too many of them), but only the atomic (i.e., single-node) subdags
and those complex subdags that can be constructed by combining properties
already in the field or by combining a property in the field with some atomic
property.
In our running example, the atomic properties are:
(22) S A B a b
Properties can be combined by adding connecting arcs. For example:
(23)
+ =A a
A
a
S
A
S A+ =
S
A
+ =A
S
A A
5.1 The Null Field
Field induction begins with the null field. With the corpus we have been as-
suming, the null field takes the following form.
15
Page 16
hidden
(24) S
A
a
A
S
A
b
A
S
a
B
S
b
B
q˘(x) = 1 1 1 1 Z = 4
q(x) = 1/4 1/4 1/4 1/4
No dag x has any features, so q˘(x) =

i β
fi(x)
i is a product of zero terms, and
hence has value 1. As a result, q is the uniform distribution. The Kullback-
Leibler distance D(p˜||q) is 0.03. The aim of property selection is to choose a
property that reduces this distance as much as possible.
The astute reader will note that there is a problem with the null field if
L(G) is infinite. Namely, it is not possible to have a uniform distribution over
an infinite set. If each dag in an infinite set of dags is assigned a constant
nonzero probability ǫ, then the total probability is infinite, no matter how small
ǫ is. There are a couple of ways of dealing with the problem. The approach that
DDL adopt is to assume a consistent prior distribution p(k) over graph sizes k,
and a family of random fields qk representing the conditional probability q(x|k);
the probability of a tree is then p(k)q(x|k). All the random fields have the same
properties and weights, differing only in their normalizing constants.
I will take a slightly different approach here. Let us adopt an initial distribu-
tion like that proposed by Brew and Eisele. There is a natural correspondence
between AV grammars and CFG’s, a correspondence that we implicitly adopted
in earlier discussion. We assume that the rules of an AV grammar are typed
feature structures in which all types (of toplevel feature structures) are disjoint.
Types correspond to categories in a CFG, and the righthand side of the CF
analogue of rule r is the list of types of immediate constituents of r, viewed as
a feature structure. For example, the AV grammar G2 has corresponding CF
grammar G1.
In this framework, a model consists of: (1) An AV grammar G whose pur-
pose is to define a set of dags L(G). (2) An SCFG H derived from G, with
weights θ, defining a distribution p˘(d) over derivations d. There is a unique
derivation corresponding to each dag in L(G), but some derivations correspond
to no well-formed dag—intuitively, some derivations lead to unification failures.
Discarding the bad derivations and renormalizing yields the initial distribution
p(x) over dags L(G). (3) A set of properties f with weights β, to define the
final distribution q(x) = 1Z

i β
fi(x)
i p(x).
There are a couple possible choices of weights θ for the initial distribution.
The easiest approach would be to adopt the ERF weights. Field induction
would then be a way of adding context-sensitivities to the ERF distribution. An
alternative would be to adopt maximum-entropy weights. The intuitive reason
for adopting the uniform distribution (in the finite case) is that it distinguishes
dags in L(G) from dags not in L(G), but otherwise makes no assumptions
about the distribution. The uniform distribution maximizes entropy over a
16
Page 17
hidden
finite set. Maximizing entropy is more generally applicable, however, and can
be applied to infinite sets as well. Maximum entropy distributions for context-
free languages are discussed in a paper by Miller and O’Sullivan [4], though a
number of technical questions arise that I do not wish to pursue here.
5.2 Property Selection
At each iteration, we select a new property f by considering all atomic properties
and all complex properties that can be constructed from properties already in
the field. Holding the weights constant for all old properties in the field, we
choose the best weight β for f (how β is chosen will be discussed shortly),
yielding a new distribution qf = qf,β . The score for property f is the reduction
it permits in D(p˜||qold), where qold is the old field. That is, the score for f is
D(p˜||qold) − D(p˜||qf ). We compute the score for each candidate property and
add to the field that property with the highest score.
To illustrate, consider the two atomic properties ‘a’ and ‘B’. Given the null
field as old field, the best weight for ‘a’ is β = 7/5, and the best weight for ‘B’
is β = 1. This yields q and D(p˜||f) as follows:
(25) S
A
a
A
S
A
b
A
S
a
B
S
b
B
p˜ 1/3 1/6 1/4 1/4
q˘a 7/5 1 7/5 1 Z = 24/5
qa 7/24 5/24 7/24 5/24
p˜ ln p˜qa 0.04 −0.04 −0.04 0.05 D = 0.01
q˘B 1 1 1 1 Z = 4
qB 1/4 1/4 1/4 1/4
p˜ ln p˜qB 0.10 −0.07 0 0 D = 0.03
The better property is ‘a’, and ‘a’ would be added to the field if these were the
only two choices.
Intuitively, ‘a’ is better than ‘B’ because ‘a’ permits us to distinguish the
set {x1, x3} from the set {x2, x4}; the empirical probability of the former is
1/3+1/4 = 7/12 whereas the empirical probability of the latter is 5/12. Distin-
guishing these sets permits us to model the empirical distribution better (since
the old field assigns them equal probability, counter to the empirical distribu-
tion). By contrast, the property ‘B’ distinguishes the set {x1, x2} from {x3, x4}.
The empirical probability of the former is 1/3 + 1/6 = 1/2 and the empirical
probability of the latter is also 1/2. The old field models these probabilities
exactly correctly, so making the distinction does not permit us to improve on
the old field. As a result, the best weight we can choose for ‘B’ is 1, which is
equivalent to not having the property ‘B’ at all.
17
Page 18
hidden
5.3 Selecting the Initial Weight
DDL show that there is a unique weight that maximizes the score for a new
property f (provided that the score for f is not constant for all weights), and
that the maximizing weight is the solution to the equation
(26) qf,β [f ] = p˜[f ]
in the single unknown β. Intuitively, we choose the weight such that the expec-
tation of f under the resulting new field is equal to its empirical expectation.
Solving equation (26) for β is easy if L(G) is small enough to enumerate.
Then the sum over L(G) that is implicit in qf,β [f ] can be expanded out, and
solving for β is simply a matter of arithmetic. Things are a bit trickier if L(G)
is too large to enumerate. DDL show that we can solve equation (26) if we can
estimate qold[f = k] for k from 0 to the maximum possible value for f .
We can estimate qold[f = k] by means of random sampling. The idea is
actually rather simple: to estimate how often the property appears in “the
average dag”, we generate a representative mini-corpus from the distribution
qold and count. That is, we generate dags at random in such a way that the
relative frequency of dag x is qold(x) (in the limit), and we count how often the
property of interest appears in dags in our generated mini-corpus.
The application that DDL consider is the induction of English orthographic
constraints—inducing a field that assigns high probability to “English-sounding”
words and low probability to non-English-sounding words. For this application,
Gibbs sampling is appropriate. Gibbs sampling does not work for the application
to AV grammars, however. Fortunately, there is an alternative random sampling
method we can use: Metropolis-Hastings sampling. We will discuss the issue in
some detail shortly.
5.4 Readjusting Weights
When a new property is added to the field, the best value for its initial weight
is chosen, but the weights for the old properties are held constant. In general,
however, adding the new property may make it necessary to readjust weights
for all properties. The second half of the DDL algorithm involves finding the
best set of weights for a given set of properties.
The method is very similar to the method for selecting the initial weight for
a new property. Let (γ1, . . . , γn) be the old weights for the properties. Consider
the equation
(27) qγ [βf#i fi] = p˜[fi]
where f#(x) = ∑i fi(x) is the total number of properties of dag x. Without
going into exactly why fi is weighted as it is on the lefthand side, the idea is the
same as before: we want to adjust βi so that the average number of instances
18
Page 19
hidden
of property fi according to the model matches the average number of instances
of property fi in dags in the corpus.
If the weights γ1, . . . , γn are not already as good as they can be, solving
equation (27) for βi (for each i) is guaranteed to improve the weights, but it
does not necessarily immediately yield the globally best weights. We can obtain
the globally best weights by iterating. Set γi ← βi, for all i, and solve equation
(27) again. Repeat until the weights no longer change.
As with equation (26), solving equation (27) is straightforward if L(G) is
small enough to enumerate, but not if L(G) is large. In that case, we must
use random sampling. We generate a representative mini-corpus and estimate
expectations by counting in the mini-corpus.
5.5 Random Sampling
We have seen that random sampling is necessary both to set the initial weight for
properties under consideration and to adjust all weights after a new property is
adopted. Random sampling involves creating a corpus that is representative of
a given model distribution q(x). To take a very simple example, a fair coin can
be seen as a method for sampling from the distribution q such that q(H) = 1/2,
q(T ) = 1/2. Saying that a corpus is representative is actually not a comment
about the corpus itself but the method by which it was generated: a corpus
representative of distribution q is one generated by a process that samples from
q. Saying that a process M samples from q is to say that the empirical distri-
butions of corpora generated by M converge to q in the limit. For example, if
we flip a fair coin once, the resulting empirical distribution over (H,T ) is either
(1, 0) or (0, 1), not the fair-coin distribution (1/2, 1/2). But as we take larger
and larger corpora, the resulting empirical distributions converge to (1/2, 1/2).
One of the advantages of SCFGs, that is lost when we go to random fields, is
that there is a transparent relationship between an SCFG defining a distribution
q and a sampler for q. We can sample from the distribution defined by an SCFG
as follows. Consider the grammar (2), repeated here as (28):
(28) 1. S → A A β1 = 1/2
2. S → B β2 = 1/2
3. A → a β3 = 2/3
4. A → b β4 = 1/3
5. B → a a β5 = 1/2
6. B → b b β6 = 1/2
The language of (28) consists of the six trees {x1 = [S [A a] [A a]], x2 = [S [B
a a]], x3 = [S [A b] [A b]], x4 = [S [B b b]], x5 = [S [A a] [A b]], x6 = [S [A b]
[A a]]} with probability distribution q : x1 7→ 2/9, x2 7→ 1/4, x3 7→ 1/18, x4 7→
1/4, x5 7→ 1/9, x6 7→ 1/9.
19
Page 20
hidden
We sample from q via stochastic derivations. In a stochastic derivation, we
start with the start symbol, S. There are two rules expanding S: S → A A and
S → B. We flip a coin to choose between them, heads for A A, tails for B.
Suppose the coin comes up heads. We expand S to A A, and then expand each
of the A’s in turn. To expand the first A, we consider the two rules A → a and
A → b. To decide between them, we flip a loaded coin that comes up heads
(A → a) 2/3 of the time and tails (A → b) 1/3 of the time. Suppose this coin
also comes up heads. We rewrite the first A as a and go to the second A. We
flip the loaded coin again; suppose it comes up heads again. We rewrite the
second A as a, and the result is tree x1. The chances of throwing three heads
in this manner are 1/2 · 2/3 · 2/3 = 2/9 = q(x1). If we sample repeatedly in
this manner, the proportion of tree x1 in the resulting corpus will converge to
2/9. This is the sense in which stochastic derivations of this sort sample from
the distribution defined by the given SCFG.
When we went from SCFGs to random fields, we lost the transparent con-
nection between the probability distribution defined by the field and a method
for sampling from it. Since weights do not sum to one for rules with the same
lefthand side—indeed, since the properties with which weights are associated
are not even necessarily rule applications—we cannot sample in the same way
as we sample from an SCFG.
There is, however, a method that can be adapted for sampling from the
random field defining a probability distribution over the language of an AV
grammar. This method is the Metropolis-Hastings algorithm. Specifically, in
the case of sets of dags with probability distribution q, we proceed as follows.
Recall that we have a grammar G consisting of feature structures. We also
have a context-free analogue H of G with weights θ, which we use to define
the initial distribution p(x). In addition, we have a field consisting of a set
of properties fi with weights βi. The grammar defines a set of Ω = L(G)
and the field plus initial distribution define a probability distribution q(x) =
1
Z

i β
fi(x)
i p(x) over Ω.
We can sample from the initial distribution p(x) by performing stochastic
derivations using grammar H . The derivations map to dags in L(G) according
to the correspondence between context-free rules and the AV rules of G. It is
possible that some of the derivations will fail—that they will map to inconsistent
dags. Those derivations are simply discarded. That is, the probability that H
assigns to a derivation is actually p˘(x); when we throw away derivations that
map to inconsistent dags, the result is to restrict p˘(x) to consistent dags and
normalize it, so that we end up sampling from p(x).
In this way, we can sample from L(G), but not in accordance with the field
probability q(x). The essence of the Metropolis-Hastings algorithm is a means
of converting the sampler for p(x) into a sampler for q(x). Suppose we are
generating a corpus, and have generated dags x1, . . . , xn. Now we wish to add
another dag, xn+1, to the corpus. We generate a dag y at random using the
20
Page 21
hidden
sampler for p(·). Now, instead of simply adding y to the corpus, we flip a loaded
coin, that comes up heads with probability
(29) A(y|x) = min{1, q(y)p(xn)q(xn)p(y)
}
If the coin comes up heads, we do include y in the corpus, that is, xn+1 = y.
But if the coin comes up tails, we throw y away and make a copy of xn instead,
that is, xn+1 = xn.
The acceptance probability A(y|x) reduces in our case to a particularly sim-
ple form. If q(y)p(xn) ≥ q(xn)p(y), then obviously A(y|x) = 1. Otherwise,
writing F (x) for the “field weight” ∏i β
fi(x)
i , we have:
A(y|x) = Z
−1F (y)p(y)p(xn)
Z−1F (xn)p(xn)p(y)
= F (y)/F (xn)
It can be shown that the result of generating a new dag with probability
p(·) and accepting it with probability A(·|xn) yields a sampler for q(·) (see e.g.
Winkler [7]). The final “acceptance” step intuitively serves the role of “punish-
ing” dags that the p-sampler proposes more often than a q-sampler would, and
shifting their probability to dags that the p-sampler would propose less often
than a q-sampler would.
In somewhat more detail, if we think of the corpus x1, x2, . . . as a random
walk through the space L(G), the Metropolis-Hastings algorithm works because
it forces the random walk to spend time in a region R proportional to the
probabiliy of R. This is accomplished, intuitively, by preservation of what is
known as detailed balance. Detailed balance requires that the probability of
making a transition from dag x to dag y in the course of the random walk
should balance the probability of making a transition from dag y to dag x.
Let q(x) be, as always, the model probability that we wish to sample from
and let q(y|x) be the transition probability—the probability of the next dag in
the corpus being y if the previous dag is x. In our case, q(y|x) (for y 6= x) is
the probability that we generate y at random, and then also accept it: q(y|x) =
p(y)A(y|x). Define q(x, y) (for x 6= y) to be the joint probability that x is
the previous dag and y is the next dag; that is, q(x, y) = q(x)q(y|x). Detailed
balance requires that q(x, y) = q(y, x). If detailed balance is preserved, it can
be shown that the empirical distribution of the corpus generated by the random
walk converges to q(·), and that the expectation of a function f taken with
respect to the empirical distributions converges to q[f ].
We can see that the transition probability we have assumed does indeed
preserve detailed balance, as follows. Let x be the last-generated tree and y the
new tree, and suppose that q(y)p(x) > q(x)p(y). Then:
21
Page 22
hidden
q(y|x) = p(y) q(x|y) = p(x) q(x)p(y)q(y)p(x)
= q(x)q(y)p(y)
q(x, y) = q(x)p(y) q(y, x) = q(y) q(x)q(y)p(y)
= q(x)p(y)
That is, q(x, y) = q(y, x) and detailed balance is confirmed. The remaining
cases q(y)p(x) < q(x)p(y) and q(y)p(x) = q(x)p(y) are similar and are left as
an exercise for the reader.
6 Final Remarks
In summary, we cannot simply transplant CF methods to the AV grammar case.
In particular, the ERF method yields correct weights only for SCFGs, not for
AV grammars. We can define a probabilistic version of AV grammars with a
correct weight-selection method by going to random fields. Property selection
and weight adjustment can be accomplished using the DDL algorithms. In
property selection, we need to use random sampling to find the initial weight
for a candidate property, and in weight adjustment we need to use random
sampling to solve the weight equation. The random sampling method that DDL
used is not appropriate for sets of dags, but we can use the Metropolis-Hastings
method.
As a closing note, it should be pointed out explicitly that the random field
techniques described here can also be profitably applied to context-free gram-
mars. As Stanley Peters nicely put it, there is a distinction between possibilis-
tic and probabilistic context-sensitivity. Even if the language described by the
grammar of interest—that is, the set of possible trees—is context-free, there
may well be context-sensitive statistical dependencies. Random fields can be
readily applied to capture such statistical dependencies whether or not L(G) is
context-sensitive.
References
[1] Chris Brew. Stochastic HPSG. In Proceedings of EACL-95, 1995.
[2] Andreas Eisele. Towards probabilistic extensions of constraint-based gram-
mars. Technical Report Deliverable R1.2.B, DYANA-2, 1994.
[3] Kevin Mark, Michael Miller, Ulf Grenander, and Steve Abney. Parameter
estimation for constrained context-free language models. In Proceedings of
the Fifth Darpa Workshop on Speech and Natural Language, San Mateo, CA,
1992. Morgan Kaufman.
22
Page 23
hidden
[4] M.I. Miller and J.A. O’Sullivan. Entropies, combinatorics and probabilities
of context-free branching processes. Technical report ESSRL-90-16, Elec-
tronic Systems and Signals Research Laboratory, Washington University,
1990.
[5] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing
features of random fields. tech report CMU-CS-95-144, CMU, 1995.
[6] Stefan Riezler. Quantitative constraint logic programming for weighted
grammar applications. Talk given at LACL, September 1996.
[7] Gerhard Winkler. Image Analysis, Random Fields and Dynamic Monte
Carlo Methods. Springer, 1995.
23

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

19 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
37% Ph.D. Student
 
16% Assistant Professor
 
16% Professor
by Country
 
42% United States
 
11% China
 
11% United Kingdom