Sign up & Download
Sign in

Bayesian measures of model complexity and fit

by David J Spiegelhalter, Nicola G Best, Bradley P Carlin, Angelika Van Der Linde
Journal of the Royal Statistical Society - Series B: Statistical Methodology ()

Abstract

We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. Using an information theoretic argument we derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general pD approximately corresponds to the trace of the product of Fisher's information and the posterior covariance, which in normal models is the trace of the 'hat' matrix projecting observations onto fitted values. Its properties in exponential families are explored. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification. The procedure is illustrated in some examples, and comparisons are drawn with alternative Bayesian and classical proposals. Throughout it is emphasized that the quantities required are trivial to compute in a Markov chain Monte Carlo analysis.

Cite this document (BETA)

Available from doi.wiley.com
Page 1
hidden

Bayesian measures of model comple...

J. R. Statist. Soc. B (2002) 64, Part 4, pp. 583���639 Bayesian measures of model complexity and fit David J. Spiegelhalter, Medical Research Council Biostatistics Unit, Cambridge, UK Nicola G. Best, Imperial College School of Medicine, London, UK Bradley P. Carlin University of Minnesota, Minneapolis, USA and Angelika van der Linde University of Bremen, Germany [Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, March 13th, 2002, Professor D. Firth in the Chair ] Summary. We consider the problem of comparing complex hierarchical models in which the number of parameters is not clearly defined. Using an information theoretic argument we derive a measure pD for the effective number of parameters in a model as the difference between the posterior mean of the deviance and the deviance at the posterior means of the parameters of interest. In general pD approximately corresponds to the trace of the product of Fisher���s information and the posterior covariance, which in normal models is the trace of the ���hat��� matrix projecting observations onto fitted values. Its properties in exponential families are explored. The posterior mean deviance is suggested as a Bayesian measure of fit or adequacy, and the contributions of individual observations to the fit and complexity can give rise to a diagnostic plot of deviance residuals against leverages. Adding pD to the posterior mean deviance gives a deviance information criterion for comparing models, which is related to other information criteria and has an approximate decision theoretic justification. The procedure is illustrated in some examples, and comparisons are drawn with alternative Bayesian and classical proposals. Throughout it is emphasized that the quantities required are trivial to compute in a Markov chain Monte Carlo analysis. Keywords: Bayesian model comparison Decision theory Deviance information criterion Effective number of parameters Hierarchical models Information theory Leverage Markov chain Monte Carlo methods Model dimension 1. Introduction The development of Markov chain Monte Carlo (MCMC) methods has made it possible to fit increasingly large classes of models with the aim of exploring real world complexities of data (Gilks et al., 1996). This ability naturally leads us to wish to compare alternative model formulations with the aim of identifying a class of succinct models which appear to describe the information in the data adequately: for example, we might ask whether we need to incorporate Address for correspondence: David J. Spiegelhalter, Medical Research Council Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge, CB2 2SR, UK. E-mail: david.spiegelhalter@mrc-bsu.cam.ac.uk ��� 2002 Royal Statistical Society 1369���7412/02/64583
Page 2
hidden
584 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde a random effect to allow for overdispersion, what distributional forms to assume for responses and random effects, and so on. Within the classical modelling framework, model comparison generally takes place by defin- ing a measure of fit, typically a deviance statistic, and complexity, the number of free parameters in the model. Since increasing complexity is accompanied by a better fit, models are compared by trading off these two quantities and, following early work of Akaike (1973), proposals are often formally based on minimizing a measure of expected loss on a future replicate data set: see, for example, Efron (1986), Ripley (1996) and Burnham and Anderson (1998). A model comparison using the Bayesian information criterion also requires the specification of the num- ber of parameters in each model (Kass and Raftery, 1995), but in complex hierarchical models parameters may outnumber observations and these methods clearly cannot be directly applied (Gelfand and Dey, 1994). The most ambitious attempts to tackle this problem appear in the smoothing and neural network literature (Wahba, 1990 Moody, 1992 MacKay, 1995 Ripley, 1996). This paper suggests Bayesian measures of complexity and fit that can be combined to compare models of arbitrary structure. In the next section we use an information theoretic argument to motivate a complexity mea- sure pD for the effective number of parameters in a model, as the difference between the posterior mean of the deviance and the deviance at the posterior estimates of the parameters of inter- est. This quantity can be trivially obtained from an MCMC analysis and algebraic forms and approximations are unnecessary for its use. We nevertheless investigate some of its formal prop- erties in the following three sections: Section 3 shows that pD is approximately the trace of the product of Fisher���s information and the posterior covariance matrix, whereas in Section 4 we show that for normal models pD corresponds to the trace of the ���hat��� matrix projecting observa- tions onto fitted values and we illustrate its form for various hierarchical models. Its properties in exponential families are explored in Section 5. The posterior mean deviance �� D can be taken as a Bayesian measure of fit or ���adequacy���, and Section 6 shows how in exponential family models an observation���s contributions to �� D and pD can be used as residual and leverage diagnostics respectively. In Section 7 we tentatively suggest that the adequacy �� D and complexity pD may be added to form a deviance information criterion DIC which may be used for comparing models. We describe how this parallels the development of non-Bayesian information criteria and provide a somewhat heuristic decision theoretic justification. In Section 8 we illustrate the use of this technique on some reason- ably complex examples. Finally, Section 9draws some conclusions concerning these proposed techniques. 2. The complexity of a Bayesian model 2.1. ���Focused��� full probability models Parametric statistical modelling of data y involves the specification of a probability model p.y|��/, �� ��� ��. For a Bayesian ���full��� probability model, we also specify a prior distribution p.��/ which may give rise to a marginal distribution p.y/ = �� p.y|��/ p.��/ d��: (1) Particular choices of p.y|��/ and p.��/ will be termed a model ���focused��� on ��. Note that we might further parameterize our prior with unknown ���hyperparameters��� �� to create a hierarchical model, so that the full probability model factorizes as
Page 3
hidden
Model Complexity and Fit 585 p.y �� ��/ = p.y ��/ p.��|��/ p.��/: Then, depending on the parameters in focus, the model may compose the likelihood p.y|��/ and prior p.��/ = �� p.��|��/ p.��/ d�� or the likelihood p.y|��/ = �� p.y|��/ p.��|��/ d�� and prior p.��/. Both these models lead to the same marginal distribution (1) but can be consid- ered as having different numbers of parameters. A consequence is that in hierarchical modelling we cannot uniquely define a ���likelihood��� or ���model complexity��� without specifying the level of the hierarchy that is the focus of the modelling exercise (Gelfand and Trevisani, 2002). In fact, by focusing our models on a particular set of parameters ��, we essentially reduce all models to non-hierarchical structures. For example, consider an unbalanced random-effects one-way analysis of variance (ANOVA) focused on the group means: yi|��i ��� N.��i ��i���1/ ��i ��� N.�� �����1/ i = 1 ::: p: (2) This model could also be focused on the overall mean �� to give yi|�� ��� N.�� ��i���1 + �����1/ in which case it could reasonably be considered as having a different complexity. It is natural to wish to measure the complexity of a focused model, both in its own right, say to assess the degrees of freedom of estimators, and as a contribution to model choice: for example, criteria such as BIC (Schwarz, 1978), AIC (Akaike, 1973), TIC (Takeuchi, 1976) and NIC (Murata et al., 1994) all trade off model fit against a measure of the effective number of parameters in the model. However, the foregoing discussion suggests that such measures of com- plexity may not be unique and will depend on the number of parameters in focus. Furthermore, the inclusion of a prior distribution induces a dependence between parameters that is likely to reduce the effective dimensionality, although the degree of reduction may depend on the data that are available. Heuristically, complexity reflects the ���difficulty in estimation��� and hence it seems reasonable that a measure of complexity may depend on both the prior information concerning the parameters in focus and the specific data that are observed. 2.2. Is there a true model? We follow Box (1976) in believing that ���all models are wrong, but some are useful���. However, it can be useful to posit a ���true��� distribution pt.Y/ of unobserved future data Y since, for any focused model, this defines a ���pseudotrue��� parameter value ��t (Sawa, 1978) which specifies a likelihood p.Y|��t/ that minimizes the Kullback���Leibler distance Et[log{pt.Y/}=p.Y|��t/] from pt.Y/. Having observed data y, under reasonably broad conditions (Berk, 1966 Bunke and Milhaud, 1998) p.��|y/ converges to ��t as information on the components of �� increases. Thus Bayesian analysis implicitly relies on p.Y|��t/ being a reasonable approximation to pt.Y/, and we shall indicate where we make use of this ���good model��� assumption.
Page 4
hidden
586 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde 2.3. True and estimated residual information The residual information in data y conditional on �� may be defined (up to a multiplicative constant) as ���2 log{p.y|��/} (Kullback and Leibler, 1951 Burnham and Anderson, 1998) and can be interpreted as a measure of ���surprise��� (Good, 1956), logarithmic penalty (Bernardo, 1979) or uncertainty. Suppose that we have an estimator ��.y/ �� of the pseudotrue parameter ��t. Then the excess of the true over the estimated residual information will be denoted d��{y ��t ��.y/} �� = ���2 log{p.y|��t/} + 2 log[p{y| ��.y/}]: �� (3) This can be thought of as the reduction in surprise or uncertainty due to estimation, or alter- natively the degree of ���overfitting��� due to ��.y/ �� adapting to the data y. We now argue that d�� may form the basis for both classical and Bayesian measures of model dimensionality, with each approach differing in how it deals with the unknown true parameters in d��. 2.4. Classical measures of model dimensionality In a non-Bayesian likelihood-based context, we may take ��.y/ �� to be the maximum likelihood estimator ��.y/, �� expand 2 log{p.y|��t/} around 2 log[p{y| ��.y/}], �� take expectations with respect to the unknown true sampling distribution pt.Y/ and hence show (Ripley, 1996) (page 34) that Et[d��{Y ��t ��.Y/}] �� ��� p�� = tr.KJ���1/ (4) where J = ���Et @2 log{p.Y|��t/} @��2 K = vart @ log{p.Y|��t/} @�� : (5) This is the measure of complexity that is used in TIC (Takeuchi, 1976). Burnham and Anderson (1998) (page 244) pointed out that p�� = tr.J��/ (6) where �� = J���1KJ���1 isthefamiliar���sandwich���approximationtothevariance���covariancematrix of the ��.y/ �� (Huber, 1967). If pt.y/ = p.y|��t/, i.e. one of the models is true, then K = J and p�� = p, the number of independent parameters in ��. For example, in a fixed effect ANOVA model yi|��i ��� N.��i ��i���1/ i = 1 ::: p with ��i���1s known, d��{y ��t ��.y/} �� = ��� i ��i.yi ��� ��i/2t whose expectation under pt.Y/ is p�� = ��i ��i Et.Yi �����t/2. If the model is true, Et.Yi �����t/2 = ��i���1 and so p�� = p. Ripley (1996) (page 140) showed how this procedure may be extended to ���regularized��� models in which a specified prior term p.��/ is introduced to form a penalized log-likelihood. Replacing log.p/ by log{p.y|��/} + log{p.��/} in equations (5) yields a more general definition of p�� that
Page 5
hidden
Model Complexity and Fit 587 was derived by Moody (1992) and termed the ���effective number of parameters���. This is the measure of dimensionality that is used in NIC (Murata et al., 1994): the estimation of p�� is generally not straightforward (Ripley, 1996). In the random-effects ANOVA example with ��i ��� N.�� �����1/ �� and �� known, let ��i = ��i=.��i + ��/ be the intraclass correlation coefficient in the ith group. We then obtain p �� = ��� i ��i��i Et.Yi ��� ��t/2 (7) which becomes p�� = ��� i ��i (8) if the likelihood is true. 2.5. A Bayesian measure of model complexity From a Bayesian perspective, the unknown ��t may be replaced by a random variable ��. Then d��{y �� ��.y/} �� can be estimated by its posterior expectation with respect to p.��|y/, denoted pD{y �� ��.y/} �� = E��|y[d��{y �� ��.y/}] �� = E��|y[���2 log{p.y|��/}] + 2 log[p{y| ��.y/}]: �� (9) pD{y �� ��.y/} �� isourproposalastheeffectivenumberofparameterswithrespecttoamodelwith focus ��: we shall usually drop the arguments {y �� ��.y/} �� from the notation. In our examples we shall generally take ��.y/ �� = E.��|y/ = ��, �� the posterior mean of the parameters. However, we note that it is not strictly necessary to use the posterior mean as an estimator of either d�� or ��, and the mode or median could be justified (Section 2.6). Taking f.y/ to be some fully specified standardizing term that is a function of the data alone, pD may be written as pD = D.��/ ��� D. ��/ �� (10) where D.��/ = ���2 log{p.y|��/} + 2 log{f.y/}: We shall term D.��/ the ���Bayesian deviance��� in general and, more specifically, for members of the exponential family with E.Y/ = ��.��/ we shall use the saturated deviance D.��/ obtained by setting f.y/ = p{y|��.��/ = y}: see Section 8.1. Equation (10) shows that pD can be considered as a ���mean deviance minus the deviance of the means���. A referee has pointed out the related argument used by Meng and Rubin (1992), who showed that such a difference, between the average of log-likelihood ratios and the likelihood ratio evaluated at the average (over multiple imputations) of the parameters, is the key quantity in estimating the degrees of freedom of a test. For example, in the random-effects ANOVA (2) with �� and �� known, D.��/ = ��� i ��i.yi ��� ��i/2
Page 6
hidden
588 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde which is ���2 log(likelihood) standardized by the term ���2 log{f.y/} = ��i log.2��=��i/ obtained from setting ��i = yi. Now ��i|y ��� N{��iyi + .1 ��� ��i/�� ��i��i���1} and hence it can be shown that the posterior distribution of D.��/ has the form D.��/ ��� ��� ��i ��2{1 .yi ��� ��/2.1 ��� ��i/��} where ��2.a b/ is a non-central ��2-distribution with mean a + b. Thus, since ��i�� = .1 ��� ��i/��i, we have D.��/ = ��� ��i + ��� ��i.1 ��� ��i/2.yi ��� ��/2 D. ��/ �� = ��� ��i.1 ��� ��i/2.yi ��� ��/2 and so pD = ��� i ��i = ��� i ��i ��i + �� : (11) The effective number of parameters is therefore the sum of the intraclass correlation coefficients, which essentially measures the sum of the ratios of the precision in the likelihood to the precision in the posterior. This exactly matches Moody���s approach (8) when the model is true. If �� is unknown and given a uniform hyperprior we obtain a posterior distribution �� ��� N{�� y .������i/���1}, where �� y = ����iyi=����i. It is straightforward to show that D.��/ = ��� ��i + �� ��� ��i.1 ��� ��i/.yi ��� �� y/ 2 + ��� ��i.1 ��� ��i/= ��� ��i D. ��/ �� = �� ��� ��i.1 ��� ��i/.yi ��� �� y/ 2 and so pD = ����i +����i.1��� ��i/=����i. If the groups are independent, �� = 0 ��i = 1 and pD = p. If the groups all have the same mean, �� ��� ��� ��i ��� 0 and pD ��� 1. If all group precisions are equal, pD = 1 + .p ��� 1/�� as obtained by Hodges and Sargent (2001). 2.6. Some observations on pD (a) Equation (10) may be rewritten as D.��/ = D. ��/ �� + pD (12) whichcanbeinterpretedasaclassical���plug-in���measureoffitplusameasureofcomplexity. Thus our Bayesian measure of fit, D.��/, could perhaps be better considered as a measure of ���adequacy���, and we shall use these terms interchangeably. However, in Section 7.3 we shall suggest that an additional penalty for complexity may be reasonable when making model comparisons. (b) Simple use of the Bayes theorem reveals the expression pD = E��|y ���2 log p.��|y/ p.��/ + 2 log p. ��|y/ �� p. ��/ �� which can be interpreted as (minus twice) the posterior estimate of the gain in information provided by the data about ��, minus the plug-in estimate of the gain in information.
Page 7
hidden
Model Complexity and Fit 589 (c) It is reasonable that the effective number of parameters in a model might depend on the data, the choice of focus �� and the prior information (Section 2.1). Less attractive, perhaps, is that pD may also depend on the choice of estimator ��.y/, �� since this can produce a lack of invariance of pD to apparently innocuous transformations, such as making inferences on logits instead of probabilities in Bernoulli trials. Our usual choice of the posterior mean is largely based on the subsequent ability to investigate approximate forms for pD (Section 3), and the positivity properties described below. A choice of, say, posterior medians would produce a measure of model complexity that was invariant to univariate 1���1 transformations, and we explore this possibility in Section 5. (d) It follows from equation (10) and Jensen���s inequality that, when using the posterior mean as an estimator ��.y/ �� pD 0 for any likelihood that is log-concave in ��, with 0 being approached for a degenerate prior on ��. Non-log-concave likelihoods can, however, give rise to a negative pD in certain circumstances. For example, consider a single observation from a Cauchy distribution with deviance D.��/ = 2 log{1 + .y ��� ��/2}, with a discrete prior assigning probability 1/11 to �� = 0 and 10/11 to �� = 3. If we observe y = 0, then the posterior probabilities are changed to 0.5 and 0.5, and so �� �� = 1:5. Thus pD = D.��/ ��� D. ��/ �� = log.10/ ��� 2 log.13=4/ = log.160=169/ 0. Our experience has been that negative pDs indicate substantial conflict between the prior and data, or where the pos- terior mean is a poor estimator (such as a symmetric bimodal distribution). (e) The posterior distribution that is used in obtaining pD conditions on the truth of the model, and hence pD may only be considered an appropriate measure of the complexity of a model that reasonably describes the data. This is reflected in the finding that pD in the simple ANOVA example (11) will not necessarily be approximately equivalent to the classical p�� (7) if the assumptions of the model are substantially inaccurate. This good model assumption (Section 2.2) is further considered when we come to comparisons of models (Section 7.3). (f) Providedthat D.��/ isavailableinclosedform, pD maybeeasilycalculatedafteranMCMC run by taking the sample mean of the simulated values of D.��/, minus the plug-in estimate of the deviance using the sample means of the simulated values of ��. No ���small sample��� adjustment is necessary. This ease of computation should be contrasted with the frequent difficulty within the classical framework with deriving the functional form of the measure of dimensionality and its subsequent estimation. (g) Since the complexity depends on the focus, a decision must be made whether nuisance parameters, e.g. variances, are to be included in �� or integrated out before specifying the model p.y|��/. However, such a removal of nuisance parameters may create computational difficulties. pD has been defined and is trivially computable by using MCMC methods, and so strictly speaking there is no need to explore exact forms or approximations. However, to provide insight into the behaviour of pD, the following three sections consider the form of pD in different situations and draw parallels with alternative suggestions: note that we are primarily concerned with the ���preasymptotic��� situation in which prior opinion is still influential and the likelihood has not overwhelmed the prior. 3. Forms for pD based on normal approximations In Section 2.1 we argued that focused models are essentially non-hierarchical with a likelihood p.y|��/ and prior p.��/. Before considering particular assumptions for these we examine the form
Page 8
hidden
590 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde of pD under two general conditions: approximately normal likelihoods and negligible prior information. 3.1. pD assuming a normal approximation to the likelihood We may expand D.��/ around E��|y.��/ = �� �� to give, to second order, D.��/ ��� D. ��/ �� + .�� ��� ��/T �� @D @�� �� �� + 1 2 .�� ��� ��/T �� @2D @��2 �� �� .�� ��� ��/ �� (13) = D. ��/ �� ��� 2.�� ��� ��/TL�� �� �� ��� .�� ��� ��/TL�� �� �� .�� ��� ��/ �� (14) where L = log{p.y|��/} and L and L represent first and second derivatives with respect to ��. This corresponds to a normal approximation to the likelihood. Taking expectations of equation (14) with respect to the posterior distribution of �� gives E��|y{D.��/} ��� D. ��/ �� ��� E[tr{.�� ��� ��/TL�� �� �� .�� ��� ��/}] �� = D. ��/ �� ��� E[tr{L�� �� .�� ��� ��/.�� �� ��� ��/T}] �� = D. ��/ �� ��� tr[L�� �� E{.�� ��� ��/.�� �� ��� ��/T}] �� = D. ��/ �� + tr.���L�� �� V/ where V = E{.�� ��� ��/.�� �� ��� ��/T} �� is the posterior covariance matrix of ��, and ���L�� �� is the observed Fisher information evaluated at the posterior mean of ��. Thus pD ��� tr.���L�� �� V/ (15) which can be thought of as a measure of the ratio of the information in the likelihood about the parameters as a fraction of the total information in the likelihood and the prior. We note the parallel with the classical p�� in equation (6). We also note that L�� �� = Q�� �� ��� P�� �� where Q = @2 log{p.��|y/}=@��2 and P = @2 log{p.��/}=@��2, and hence approximation (15) can be written pD ��� tr.���Q�� �� V/ ��� tr.���P�� �� V/: Under approximate posterior normality V ���1 ��� ���Q�� �� and hence pD ��� p ��� tr.���P�� �� V/ (16) where p is the cardinality of ��.
Page 9
hidden
Model Complexity and Fit 591 3.2. pD for approximately normal likelihoods and negligible prior information Consider a focused model in which p.��/ is assumed to be dominated by the likelihood, either becauseofassuminga���flat���priororbyincreasingthesamplesize.Assumethattheapproximation ��|y ��� N. �� �� ���L�� �� / (17) holds, where �� �� = �� �� are the maximum likelihood estimates such that L�� �� = 0 (Bernardo and Smith (1994), section 5.3). From equation (14) D.��/ ��� D. ��/ �� ��� .�� ��� ��/TL�� �� �� .�� ��� ��/ �� ��� D. ��/ �� + ��p 2 (18) since, by approximation (17), ���.�� ��� ��/TL�� �� �� .�� ��� ��/ �� has an approximate ��2-distribution with p degrees of freedom. Rearranging approximation (18) and taking expectations with respect to the posterior distribution of �� reveals that pD = E��|y{D.��/} ��� D. ��/ �� ��� p i.e. pD will be approximately the true number of parameters: this approximation could also be derived by letting P�� �� ��� 0 in approximation (16). This approximate identity is illustrated in Section 8.1. We note in passing that we might use MCMC output to estimate the classical deviance D. ��/ �� of any likelihood-based model by �� D. ��/ �� = E��|y{D.��/} ��� p: (19) Although the maximum likelihood deviance is theoretically the minimum of D over all feasible values of �� D. ��/ �� will generally be very badly estimated by the sample minimum over an MCMC run, and so the estimator given by equation (19) may be preferable. 4. pD for normal likelihoods In this section we illustrate the formal behaviour of pD for normal likelihoods by using exact and approximate identities. However, it is important to keep in mind that in practice such forms are unnecessary for computation and that pD should automatically allow for fixed effects, random effects and unknown precisions. 4.1. The normal linear model We consider the general hierarchical normal model described by Lindley and Smith (1972). Suppose that y ��� N.A1�� C1/ �� ��� N.A2�� C2/ (20) where all matrices and vectors are of appropriate dimension, and C1 and C2 are assumed known and �� is the focus: unknown precisions are considered in Section 4.5. Then the standardized deviance is D.��/ = .y���A1��/TC1 ���1.y���A1��/ and the posterior distribution for �� is normal with
Page 10
hidden
592 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde mean �� �� = Vb and covariance V: V and b will be left unspecified for the moment. Expressing y ��� A1�� as y ��� A1 �� �� + A1 �� �� ��� A1�� reveals that D.��/ = D. ��/ �� ��� 2.y ��� A1 ��/TC1 �� ���1A1.�� ��� ��/ �� + .�� ��� ��/TA1 �� TC1 ���1A1.�� ��� ��/: �� Taking expectations with respect to the posterior distribution of �� eliminates the middle term and gives �� D = D. ��/ �� + tr.A1 TC1 ���1A1V/ and thus pD = tr.A1 TC1 ���1A1V/: We note that A1 TC1 ���1A1 is the Fisher information ���L V is the posterior covariance matrix and hence pD = tr.���L V/: (21) an exact version of approximation (15). It is also clear that in this context pD is invariant to affine transformations of ��. If �� is assumed known, then Lindley and Smith (1972) showed that V ���1 = A1 TC1 ���1A1 + C2 ���1 and hence from equation (21) pD = p ��� tr.C2 ���1V/ (22) as an exact version of approximation (16) then 0 pD p, and p ��� pD is the measure of the ���shrinkage��� of the posterior estimates towards the prior means. If .C2 ���1V/���1 = A1 TC1 ���1A1C2 +Ip has eigenvalues ��i + 1 i = 1 ::: p, then pD = p ��� i=1 ��i ��i + 1 (23) and hence the upper bound for pD is approached as the eigenvalues of C2 become large, i.e. the prior becomes flat. It can further be shown, in the case A1 = In, that pD is the sum of the squared canonical correlations between data Y and the ���signal��� ��. 4.2. The ���hat��� matrix and leverages A revealing identity is found by noting that b = A1 TC1 ���1 y and the fitted values for the data are given by �� y = A1 �� �� = A1Vb = A1VA1 TC1 ���1 y. Thus the hat matrix that projects the data onto the fitted values is H = A1VA1 TC1 ���1, and pD = tr.A1 TC1 ���1A1V/ = tr.A1VA1 TC1 ���1/ = tr.H/: (24) This identity also holds assuming that �� is unknown with a uniform prior, in which case Lindley and Smith (1972) showed that V ���1 = A1 TC1 ���1A1 + C2 ���1 ��� C2 ���1A2.A2 TC2 ���1A2/���1A2 TC2 ���1. The identification of the effective number of parameters with the trace of the hat matrix is a standard result in linear modelling and has been applied to smoothing (Wahba, 1990) (page 63) and generalized additive models (Hastie and Tibshirani (1990), section 3.5), and is also the conclusion of Hodges and Sargent (2001) in the context of general linear models. The advantage of using the deviance formulation for specifying pD is that all matrix manipulation and asymptotic approximation is avoided: see Section 4.4 for further discussion. Note that tr.H/ is the sum of terms which in regression diagnostics are identified as the individual leverages, the influence of each observation on its fitted value: we shall return to this identity in Section 6.3.
Page 11
hidden
Model Complexity and Fit 593 Ye (1998) considered the independent normal model yi ��� N.��i �����1/ and suggested that the effective number of parameters should be ��i hi, where hi.��/ = @Ey|��. ��i/ �� @��i : (25) theaveragesensitivityofanunspecifiedestimate ��i �� toasmallchangein yi.Thisisageneralization of the trace of the hat matrix discussed above. In the context of the normal linear models, it is straightforward to show that EY|��. ��/ �� = H��, and hence pD = tr.H/ matches Ye���s suggestion for model complexity. Further connections with Ye (1998) are described in Section 7.2. 4.3. Example: Laird���Ware mixed models Laird and Ware (1982) specified the mixed normal model as y ��� N.X�� + Z�� C1/ �� ��� N.0 D/ where the covariance matrices C1 and D are currently assumed known. The random effects are ��, and the fixed effects are ��, and placing a uniform prior on �� we can write this model within the general Lindley���Smith formulation (20) by setting �� = .�� ��/ A1 = .X Z/ �� = 0 and C2 as a block diagonal matrix with ��� in the top left-hand block, D in the bottom right and 0 elsewhere. We have already shown that in these circumstances pD = tr{A1 TC1 ���1A1.A1 TC1 ���1A1 +C2 ���1/���1}, and substituting in the appropriate entries for the Laird���Ware model gives pD = tr.V��V ���1/, where V�� = XTC1 ���1X XTC1 ���1Z ZTC1 ���1 X ZTC1 ���1 Z V = XTC1 ���1X XTC1 ���1Z ZTC1 ���1X ZTC1 ���1Z + D���1 which is the precision of the parameter estimates assuming that D���1 = 0, relative to the precision assuming informative D. 4.4. Frequentist approaches to model complexity: smoothing and normal non-linear models A common model in semiparametric regression is y ��� N.X�� + �� �����1C1/ �� ��� N.0 �����1D/ where �� is a vector of length n of function values of the nonparametric part of an interpolation spline (Wahba, 1990 van der Linde, 1995) and C1 and D are assumed known. Motivated by the need to estimate the unknown scale factors �����1 and �����1, for many years the effective number of parameters has been taken to be the trace of the hat matrix (Wahba (1990), page 63) and so, for example, �����1 �� is the residual sum of squares divided by the ���effective degrees
Page 12
hidden
594 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde of freedom��� n ��� tr.H/. In this class of models this measure of complexity coincides with pD. Interest in regression diagnostics (Eubank, 1985 Eubank and Gunst, 1986) and cross-validation to determine the smoothing parameter ��=�� (Wahba (1990), section 4.2) also drew attention to the diagonal entries of the hat matrix as leverage values. LinkstopartiallyBayesianinterpolationmodelshavebeenprovidedbyKimeldorfandWahba (1970) and Wahba (1978, 1983) and further work built on these ideas. For example, another large class of models can be formulated by using the following extension to the Lindley���Smith model: y ��� N{g.��/ �����1C1} �� ��� N.A2�� �����1D/ where g is a non-linear expression as found, for example, in pharmacokinetics or neural net- works: in many situations A2�� will be 0 and C1 and D will be identity matrices. Define q.��/ = .y ��� g.��//TC1 ���1.y ��� g.��// r.��/ = .�� ��� A2��/TD���1.�� ��� A2��/ as the likelihood and prior residual variation. MacKay (1992) suggested estimating �� and �� by maximizing the ���type II��� likelihood p.y|�� ��/ derived from integrating out the unknown �� from the likelihood. Setting derivatives equal to 0 eventually reveals that �����1 �� = q. ��/ �� n ��� pD �����1 �� = r. ��/ �� pD which are the fitted likelihood and prior residual variation, divided by the appropriate effective degrees of freedom: pD = tr.H/ is the key quantity. These results were derived by MacKay (1992) in the context of ���regularization��� in complex interpolation models such as neural networks, in which the parameters �� are standardized and assumed to have independent normal priors with mean 0 and precision ��. Then expression (16) may be written pD ��� p ��� �� tr.V/: (26) However, MacKay���s use of approximation (26) requires the evaluation of tr.V/, whereas our pD arises without any additional computation. We would also recommend including �� and �� in the general MCMC estimation procedure, rather than relying on type II maximum likelihood estimates (Ripley (1996), page 167). In this and the smoothing context a fully Bayesian analysis requires prior distributions for �����1 and �����1 to be specified (van der Linde, 2000), and this will both change the complexity of the model and require a choice of estimator of the precisions. We shall now illustrate the form of pD in the restricted situation of unknown �����1. 4.5. Normal models with unknown sampling precision Introducing unknown variances as part of the focus confronts us with the need to choose a form for the plug-in posterior estimates. We may illustrate this issue by extending the general hierarchical normal model (20) to the conjugate normal���gamma model with an unknown scale
Page 13
hidden
Model Complexity and Fit 595 parameter �� in both the likelihood and the prior (Bernardo and Smith (1994), section 5.2.1). Suppose that y ��� N.A1�� �����1C1/ �� ��� N.A2�� �����1C2/ (27) and we focus on .�� ��/. The standardized deviance is D.�� ��/ = �� q.��/ ��� n log.��/, where q.��/ = .y ��� A1��/TC1 ���1.y ��� A1��/ is the residual variation. Then, for a currently unspecified estimator ��, �� pD = E�� ��|y.D|�� ��/ ��� D. �� �� ��/ �� = E��|y[E��|�� y{�� q.��/} ��� n log.��/] ��� {�� �� q. ��/ �� ��� n log.��/} �� = tr.H/ + q. ��/.�� �� �� ��� ��/ �� ��� n{log.��/ ��� log.��/} �� (28) where H = A1 TC1 ���1A1.A1 TC1 ���1A1 + C2 ���1/���1 is the hat matrix which does not depend on ��. Thus the additional uncertain scale parameter adds the second two terms to the complexity of the model. A conjugate prior �� ��� gamma.a b/ leads to a posterior distribution ��|y ��� gamma.a + n=2 b + S=2/, where S = .y ��� A1A2��/T.C1 + A1 TC2A1/���1.y ��� A1A2��/: It remains to choose the estimator �� �� to place in equation (28), and we shall consider two options. Suppose that we parameterize in terms of �� and use �� �� = �� �� = a + n=2 b + S=2 making the second term in equation (28) 0. Now if X ��� gamma.a b/, then E{log.X/} = ��.a/ ��� log.b/ where �� is the digamma function, and so log.��/ = ��.a + n=2/ ��� log.b + S=2/. Hence the term contributing to pD due to the unknown precision is pD ��� tr.H/ = ���n �� a + n 2 ��� log a + n 2 ��� 1 ��� 2a ��� 1 3 2a + n using the approximation ��.x/ ��� log.x/���1=2x���1=12x2. This term will tend to 1+1=3n as prior information becomes negligible and hence will be close to the ���correct��� value of 1 for moderate sample sizes. If we were to parameterize in terms of log.��/ and to use �� �� = exp{log.��/}, the third term in equation (28) is 0 and the second term can be shown to be 1 ��� O.n���1/. Thus for reasonable sample sizes the choice of parameterization of the unknown precision will make little difference to the measure of complexity. However, in Section 7 we shall argue that the log-scale may be more appropriate owing to the better approximation to likelihood normality.
Page 14
hidden
596 D. J. Spiegelhalter, N. G. Best, B. P. Carlin and A. van der Linde 5. Exponential family likelihoods We assume that we have p groups of observations, where each of the ni observations in group i has the same distribution. Following McCullagh and Nelder (1989), we define a one-parameter exponential family for the jth observation in the ith group as log{p.yij|��i ��/} = wi{yij��i ��� b.��i/}=�� + c.yij ��/ (29) where ��i = E.Yij|��i ��/ = b .��i/ V.Yij|��i ��/ = b .��i/��=wi and wi is a constant. If the canonical parameterization �� is the focus of the model, then writing ��i b = E��i|y{b.��i/} we easily obtain that the contribution of the ith group to the effective number of parameters is pDi �� = 2niwi{ ��i b ��� b. ��i/}=��: �� (30) These likelihoods highlight the issue of the lack of invariance of pD to reparameterization, since the mean parameterization �� will give a different complexity pDi. �� This is first explored within simple binomial and Poisson models with conjugate priors, and then exact and approximate forms of pD are examined for generalized linear and generalized linear mixed models. 5.1. Binomial likelihood with conjugate prior In the notation of equation (29), �� = 1 wi = 1 and �� = logit.��/ = log{��=.1 ��� ��/}, and the (unstandardized) deviance is D.��i/ = ���2yi log.��i/ ��� 2.ni ��� yi/ log.1 ��� ��i/ where yi = ��jyij. A conjugate prior ��i = {1 + exp.�����i/}���1 ��� beta.a b/ provides a posterior ��i ��� beta.a + yi b + ni ��� yi/ with mean .a + yi/=.a + b + ni/. Now, if X ��� beta.a b/, then E{log.X/} = ��.a/ ��� ��.a + b/ and E{log.1 ��� X/} = ��.b/ ��� ��.a + b/ where �� is the digamma function, and hence it can be shown that D.��i/ = D.��i/ = ���2yi ��.a + yi/ ��� 2.ni ��� yi/ ��.b + ni ��� yi/ + 2ni ��.a + b + ni/ D. ��i/ �� = ���2yi log.a + yi/ ��� 2.ni ��� yi/ log.b + ni ��� yi/ + 2ni log.a + b + ni/ D. ��i/ �� = ���2yi ��.a + yi/ + 2yi ��.b + ni ��� yi/ + 2ni log[1 + exp{��.a + yi/ ��� ��.b + ni ��� yi/}] D.��i med/ = D.��i med/ = ���2yi log.��i med/ ��� 2.ni ��� yi/ log.1 ��� ��i med/ where ��i med denotes the posterior median of ��i. Exact pDi s are obtainable by subtraction, and Fig. 1 shows how the value of pDi depends on the parameterization, the data and the prior. We may also gain further insight into the behaviour of pDi by considering approximate formulae for the mean and canonical parameterizations by using ��.x/ ��� log.x/ ��� 1=2x ��� log.x ��� 1 2 /. This leads to pDi �� ��� yi a + yi + ni ��� yi b + ni ��� yi ��� ni a + b + ni 7 pDi �� ��� ni a + b + ni ��� 1 2 : (31) We make the following observations.
Page 15
hidden
Model Complexity and Fit 597 Fig. 1. Binomial likelihood���contribution of the ith group to the effective number of parameters under various parameterizations (canonical pDi �� , mean pDi �� and median pDi med) as a function of the data (sample size ni and observed proportion yi=ni) and prior (effective prior sample size a + b and prior mean a=(a + b)): we are seeking agreement between alternative parameterizations with little dependence on data

Readership Statistics

303 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
17% Post Doc
 
11% Researcher (at an Academic Institution)
by Country
 
32% United States
 
12% United Kingdom
 
7% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in