Maximum likelihood from incomplet...
Maximum Likelihood from Incomplete Data via the EM Algorithm A. P. Dempster N. M. Laird D. B. Rubin Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. Stable URL: http://links.jstor.org/sici?sici=0035-9246%281977%2939%3A1%3C1%3AMLFIDV%3E2.0.CO%3B2-Z Journal of the Royal Statistical Society. Series B (Methodological) is currently published by Royal Statistical Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/rss.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact email@example.com. http://www.jstor.org Mon Apr 2 09:45:50 2007
Maximum Likelihood from Incomplete Data via the EM Algorithm By A. P. DEMPSTER, N. M. LAIRD and D. B. RDIN Harvard University and Educational Testing Service [Read before the ROYAL STATISTICAL at a meeting organized by the RESEARCH SOCIETY SECTION on Wednesday, December 8th, 1976, Professor S. D. SILVEY in the Chair] A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis. Keywords : MAXIMUM LIKELIHOOD INCOMPLETE DATA EM ALGORITHM POSTERIOR MODE 1. INTRODUCTION THIS paper presents a general approach to iterative computation of maximum-likelihood estimates when the observations can be viewed as incomplete data. Since each iteration of the algorithm consists of an expectation step followed by a maximization step we call it the EM algorithm. The EM process is remarkable in part because of the simplicity and generality of the associated theory, and in part because of the wide range of examples which fall under its umbrella. When the underlying complete data come from an exponential family whose maximum-likelihood estimates are easily computed, then each maximization step of an EM algorithm is likewise easily computed. The term "incomplete data" in its general form implies the existence of two sample spaces %Y and X and a many-one mapping f r o m 3 to Y. The observed data y are a realization from CY. The corresponding x in X is not observed directly, but only indirectly through y. More specifically, we assume there is a mapping x + y(x) from X to Y, and that x is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. We refer to x as the complete data even though in certain examples x includes what are traditionally called parameters. We postulate a family of sampling densities f(x I +) depending on parameters and derive its corresponding family of sampling densities g(y[+). complete-data specification f(...1 ...) is related to the incomplete-data specification g( ...IThe ...) by (1.1) The EM algorithm is directed at finding a value of + which maximizes g(y 1 +) g' iven an observed y, but it does so by making essential use of the associated family f(xl+). Notice that given the incomplete-data specification g(y1 +), there are many possible complete-data specificationsf (x)+) that will generate g(y 1 +). Sometimes a natural choice will be obvious, at other times there may be several different ways of defining the associated f(xl+). Each iteration of the EM algorithm involves two steps which we call the expectation step (E-step) and the maximization step (M-step). The precise definitions of these steps, and their associated heuristic interpretations, are given in Section 2 for successively more general types of models. Here we shall present only a simple numerical example to give the flavour of the method.
2 DEMPSTER Maximum Likelihood from Incomplete Data et al. - [No. 1, Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of A genetic model for the population specifies cell probabilities (4+in, &(l-n), &(I -n), in) for some n with 06 n 1. Thus Rao uses the parameter 0 where n = (1- 0), and carries through one step of the familiar Fisher-scoring procedure for maximizing g(y / ( I - 0),) given the observed y. To illustrate the EM algorithm, we represent y as incomplete data from a five-category multinomial population where the cell probabilities are (i, an, i(l -n), &(l -n), in), the idea being to split the first of the original four categories into two categories. Thus the complete data consist of X = (XI, XZ, X3, X4, ~ 5 ) where Y l = XI+ x2, YZ = x3, Y3 = x4, Y4 = x5, and the complete data specification is (XI+ x2 +X3 +X4 +x5) !(~)ZI (in).. (a -iTp ($ -4.1~~ (in)Xs. (1.3) f(x14 = xl! x,! x3! x4! x,! Note that the integral in (1.1) consists in this case of summing (1.3) over the (xl,xJ pairs (0,125), (1,124), ...,(125, O), while simply substituting (18,20,34) for (x3, x,, x,). To define the EM algorithm we show how to find n(p+l)from n(p), where n(p)denotes the value of n after p iterations, for p = 0,1,2, .... As stated above, two steps are required. The expectation step estimates the sufficient statistics of the complete data x, given the observed data y. In our case, (x3, x4, x,) are known to be (18,20,34) so that the only sufficient statistics that have to be estimated are xl and x, where x,+x, = y1 = 125. Estimating x1 and x, using the current estimate of n leads to ~ $ 1 3 )= 125- 8 and xip) = 125- g+&n(P) g +tn(p)'in(p) The maximization step then takes the estimated complete data (x:p),xip), 18,20,34) and estimates n by maximum likelihood as though the estimated complete data were the observed data, thus yielding The EM algorithm for this example is defined by cycling back and forth between (1.4) and (1.5). Starting from an initial value of do) = 0.5, the algorithm moved for eight steps as displayed in Table 1. By substituting xip) from equation (1.4) into equation (IS), and letting n* = n(p) = n(p+l)we can explicitly solve a quadratic equation for the maximum-likelihood estimate of n : The second column in Table 1 gives the deviation n(p)-n*, and the third column gives the ratio of successive deviations. The ratios are essentially constant for p 2 3. The general theory of Section 3 implies the type of convergence displayed in this example.
19771 DEMPSTER Maximum Likelihood from Incomplete Data 3 et al. - The EM algorithm has been proposed many times in special circumstances. For example, Hartley (1958) gave three multinomial examples similar to our illustrative example. Other examples to be reviewed in Section 4 include methods for handling missing values in normal models, procedures appropriate for arbitrarily censored and truncated data, and estimation TABLE1 The EM aIgorithm in a simple case P T f 9) T(") -T* (Tf9+1) -T*)+(T'") -T*) methods for finite mixtures of parametric families, variance components and hyperparameters in Bayesian prior distributions of parameters. In addition, the EM algorithm corresponds to certain robust estimation techniques based on iteratively reweighted least squares. We anticipate that recognition of the EM algorithm at its natural level of generality will lead to new and useful examples, possibly including the general approach to truncated data proposed in Section 4.2 and the factor-analysis algorithms proposed in Section 4.7. Some of the theory underlying the EM algorithm was presented by Orchard and Woodbury (1972), and by Sundberg (1976), and some has remained buried in the literature of special examples, notably in Baum et al. (1970). After defining the algorithm in Section 2, we demonstrate in Section 3 the key results which assert that successive iterations always increase the likelihood, and that convergence implies a stationary point of the likelihood. We give sufficient conditions for convergence and also here a general description of the rate of con- vergence of the algorithm close to a stationary point. Although our discussion is almost entirely within the maximum-likelihood framework, the EM technique and theory can be equally easily applied to finding the mode of the posterior distribution in a Bayesian framework. The extension required for this application appears at the ends of Sections 2 and 3. 2. DEFINITIONS OF THE EM ALGORITHM We now define the EM algorithm, starting with cases that have strong restrictions on the complete-data specification f(x1 +), then presenting more general definitions applicable when these restrictions are partially removed in two stages. Although the theory of Section 3 applies at the most general level, the simplicity of description and computational procedure, and thus the appeal and usefulness, of the EM algorithm are greater at the more restricted levels. Suppose first that f(x 1 +) has the regular exponential-family form where + denotes a 1 x r vector parameter, t(x) denotes a 1x r vector of complete-data sufficient statistics and the superscript T denotes matrix transppse. The term regular means here that + is restricted only to an r-dimensional convex set !2 such that (2.1) defines a density for all + in Q. The parameterization + in (2.1) is thus unique up to an arbitrary non-singular r x r linear transformation, as is the corresponding choice of t(x). Such parameters are often called
4 DEMPSTER et al. - Maximum Likelihood from Incomplete Data [No. 1, natural parameters, although in familiar examples the conventional parameters are often non-linear functions of +. For example, in binomial sampling, the conventional parameter .rr and the natural parameter q5 are related by the formula q5 = log.rr/(l -r).In Section 2, we adhere to the natural parameter representation for + when dealing with exponential families, while in Section 4 we mainly choose conventional representations. We note that in (2.1) the sample space S over which f(xl+) 0 is the same for all + in i2. We now present a simple characterization of the EM algorithm which can usually be applied when (2.1) holds. Suppose that +(p) denotes the current value of + after p cycles of the algorithm. The next cycle can be described in two steps, as follows: E-step: Estimate the complete-data sufficient statistics t(x) by finding M-step: Determine +(pfl) as the solution of the equations Equations (2.3) are the familiar form of the likelihood equations for maximum-likelihood estimation given data from a regular exponential family. That is, if we were to suppose that t(p) represents the sufficient statistics computed from an observed x drawn from (2.1), then equations (2.3) usually define the maximum-likelihood estimator of +. Note that for given x, maximizing log f(x I +) = is equivalent to maximizing -loga(+) +log b(x) ++ t ( ~ ) ~ which depends on x only through t(x). Hence it is easily seen that equations (2.3) define the usual condition for maximizing -logs(+)++t(p)Twhether or not t(p) computed from (2.2) represents a value of t(x) associated with any x in S. In the example of Section 1, the compo- nents of x are integer-valued, while their expectations at each step usually are not. A difficulty with the M-step is that equations (2.3) are not always solvable for + in i2. In such cases, the maximizing value of + lies on the boundary of i2 and a more general definition, as given below, must be used. However, if equations (2.3) can be solved for + in i2, then the solution is unique due to the well-known convexity property of the log-likelihood for regular exponential families. Before proceeding to less restricted cases, we digress to explain why repeated application of the E- and M-steps leads ultimately to the value +* of + that maximizes L(+) = 1% g(y I +I�� (2.4) where g(y 1 +) is defined from (1.1) and (2.1). Formal convergence properties of the EM algorithm are given in Section 3 in the general case. First, we introduce notation for the conditional density of x given y and +, namely, so that (2.4) can be written in the useful form For exponential families, we note that k(x 1 Y, +) = b(x) exp (+t(~)~)/a(+ l Y), where n