Generalized linear mixed models: ...
Generalized linear mixed models: a practical guide for ecology and evolution Benjamin M. Bolker1, Mollie E. Brooks1, Connie J. Clark1, Shane W. Geange2, John R. Poulsen1, M. Henry H. Stevens3 and Jada-Simone S. White1 1 Department of Botany and Zoology, University of Florida, PO Box 118525, Gainesville, FL 32611-8525, USA 2 School of Biological Sciences, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand 3 Department of Botany, Miami University, Oxford, OH 45056, USA How should ecologists and evolutionary biologists analyze nonnormal data that involve random effects? Nonnormal data such as counts or proportions often defy classical statistical procedures. Generalized linear mixed models (GLMMs) provide a more flexible approach for analyzing nonnormal data when random effects are pre- sent. The explosion of research on GLMMs in the last decade has generated considerable uncertainty for prac- titioners in ecology and evolution. Despite the availability of accurate techniques for estimating GLMM parameters in simple cases, complex GLMMs are challenging to fit and statistical inference such as hypothesis testing remains difficult. We review the use (and misuse) of GLMMs in ecology and evolution, discuss estimation andinferenceandsummarize���best-practice���dataanalysis procedures for scientists facing this challenge. Generalized linear mixed models: powerful but challenging tools Data sets in ecology and evolution (EE) often fall outside the scope of the methods taught in introductory statistics classes. Where basic statistics rely on normally distributed data, EE data are often binary (e.g. presence or absence of a species in a site [1], breeding success [2], infection status of individuals or expression of a genetic disorder [3]), pro- portions (e.g. sex ratios [4], infection rates [5] or mortality rates within groups) or counts (number of emerging seed- lings [6], number of ticks on red grouse chicks [7] or clutch sizes of storks [2]). Where basic statistical methods try to quantify the exact effects of each predictor variable, EE problems often involve random effects, whose purpose is instead to quantify the variation among units. The most familiar types of random effect are the blocks in exper- iments or observational studies that are replicated across sites or times. Random effects also encompass variation among individuals (when multiple responses are measured per individual, such as survival of multiple offspring or sex ratios of multiple broods), genotypes, species and regions or time periods. Whereas geneticists and evolutionary biol- ogists have long been interested in quantifying the mag- nitude of variation among genotypes [8���10], ecologists have more recently begun to appreciate the importance of random variation in space and time [11] or among individuals [12]. Theoretical studies emphasize the effects of variability on population dynamics [13,14]. In addition, estimating variability allows biologists to extrapolate stat- istical results to individuals or populations beyond the study sample. Researchers faced with nonnormal data often try short- cuts such as transforming data to achieve normality and homogeneity of variance, using nonparametric tests or rely- ing on the robustness of classical ANOVA to nonnormality for balanced designs [15]. They might ignore random effects altogether (thus committing pseudoreplication) or treat them as fixed factors [16]. However, such shortcuts can fail (e.g. count data with many zero values cannot be made normal by transformation). Even when they succeed, they might violate statistical assumptions (even nonparametric tests make assumptions, e.g. of homogeneity of variance across groups) or limit the scope of inference (one cannot extrapolate estimates of fixed effects to new groups). Instead of shoehorning their data into classical statisti- cal frameworks, researchers should use statistical approaches that match their data. Generalized linear mixed models (GLMMs) combine the properties of two statistical frameworks that are widely used in EE, linear mixed models (which incorporate random effects) and generalized linear models (which handle nonnormal data by using link functions and exponential family [e.g. nor- mal, Poisson or binomial] distributions). GLMMs are the best tool for analyzing nonnormal data that involve ran- dom effects: all one has to do, in principle, is specify a distribution, link function and structure of the random effects. For example, in Box 1, we use a GLMM to quantify the magnitude of the genotype���environment interaction in the response of Arabidopsis to herbivory. To do so, we select a Poisson distribution with a logarithmic link (typical for count data) and specify that the total number of fruits per plant and the responses to fertilization and clipping could vary randomly across populations and across genotypes within a population. However, GLMMs are surprisingly challenging to use even for statisticians. Although several software packages can handle GLMMs (Table 1), few ecologists and evolution- ary biologists are aware of the range of options or of the possible pitfalls. In reviewing papers in EE since 2005 Review Corresponding author: Bolker, B.M. (bolker@ufl.edu). 0169-5347/$ ��� see front matter �� 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2008.10.008 127
found by Google Scholar, 311 out of 537 GLMM analyses (58%) used these tools inappropriately in some way (see online supplementary material). Here we give a broad but practical overview of GLMM procedures. Whereas GLMMs themselves are uncontroversial, describing how to use them to analyze data necessarily touchesoncontroversialstatisticalissuessuchasthedebate over null hypothesis testing [17], the validity of stepwise regression [18] and the use of Bayesian statistics [19]. Others have thoroughly discussed these topics (e.g. [17��� 19]) we acknowledge the difficulty while remaining agnos- tic. We first discuss the estimation algorithms available for fittingGLMMstodatatofindparameterestimates.Wethen describe the inferential procedures for constructing confi- dence intervals on parameters, comparing and selecting models and testing hypotheses with GLMMs. Finally, we summarize reasonable ���best practices��� for using these tech- niques to answer ecological and evolutionary questions. Estimation Estimating the parameters of a statistical model is a key step in most statistical analyses. For GLMMs, these parameters are the fixed-effect parameters (effects of cov- ariates, differences among treatments and interactions: in Box 1, these are the overall fruit set per individual and the effects of fertilization, clipping and their interaction on fruit set) and random-effect parameters (the standard deviations of the random effects: in Box 1, variation in fruit set, fertilization, clipping and interaction effects across genotypes and populations). Many modern statisti- cal tools, including GLMM estimation, fit these parameters by maximum likelihood (ML). For simple analyses where the response variables are normal, all treatments have equal sample sizes (i.e. the design is balanced) and all random effects are nested effects, classical ANOVA methods based on computing differences of sums of squares give the same answers as ML approaches. However, this equivalence breaks down for more complex LMMs or for GLMMs: to find ML estimates, one must integrate like- lihoods over all possible values of the random effects ([20,21] Box 2). For GLMMs this calculation is at best slow, and at worst (e.g. for large numbers of random effects) computationally infeasible. Statisticians have proposed various ways to approxi- mate the likelihood to estimate GLMM parameters, in- cluding pseudo- and penalized quasilikelihood (PQL [22��� 24]), Laplace approximations [25] and Gauss-Hermite quadrature (GHQ [26]), as well as Markov chain Monte Carlo (MCMC) algorithms [27] (Table 1). In all of these approaches, one must distinguish between standard ML estimation, which estimates the standard deviations of the random effects assuming that the fixed-effect estimates are precisely correct, and restricted maximum likelihood (REML) estimation, a variant that averages over some of the uncertainty in the fixed-effect parameters [28,29]. Glossary Bayesian statistics: a statistical framework based on combining data with subjective prior information about parameter values in order to derive posterior probabilities of different models or parameter values. Bias: inaccuracy of estimation, specifically the expected difference between an estimate and the true value. Block random effects: effects that apply equally to all individuals within a group (experimental block, species, etc.), leading to a single level of correlation within groups. Continuous random effects: effects that lead to between-group cor- relations that vary with distance in space, time or phylogenetic history. Crossed random effects: multiple random effects that apply indepen- dentlyto an individual, such astemporal andspatial blocks in thesame design, where temporal variability acts on all spatial blocks equally. Exponential family: a family of statistical distributions including the normal, binomial, Poisson, exponential and gamma distributions. Fixed effects: factors whose levels are experimentally determined or whose interest lies in the specific effects of each level, such as effects of covariates, differences among treatments and interactions. Frequentist (sampling-based) statistics: a statistical framework based on computing the expected distributions of test statistics in repeated samples of the same system. Conclusions are based on the probabilities of observing extreme events. Generalized linear models (GLMs): statistical models that assume errors from the exponential family predicted values are determined by discrete and continuous predictor variables and by the link func- tion (e.g. logistic regression, Poisson regression) (not to be confused with PROC GLM in SAS, which estimates general linear models such as classical ANOVA.). Individual random effects: effects that apply at the level of each individual (i.e. ���blocks��� of size 1). Information criteria and information-theoretic statistics: a statistical framework based on computing the expected relative distance of competing models from a hypothetical true model. Linear mixed models (LMMs): statistical models that assume normally distributed errors and also include both fixed and random effects, such as ANOVA incorporating a random effect. Link function: a continuous function that defines the response of variables to predictors in a generalized linear model, such as logit and probit links. Applying the link function makes the expected value of the response linear and the expected variances homogeneous. Markov chain Monte Carlo (MCMC): a Bayesian statistical technique that samples parameters according to a stochastic algorithm that con- verges on the posterior probability distribution of the parameters, combining information from the likelihood and the posterior distri- butions. Maximum likelihood (ML): a statistical framework that finds the parameters of a model that maximizes the probability of the observed data (the likelihood). (See Restricted maximum likelihood.) Model selection: any approach to determining the best of a set of candidate statistical models. Information-theoretic tools such as AIC, which also allow model averaging, are generally preferred to older methods such as stepwise regression. Nested models: models that are subsets of a more complex model, derived by setting one or more parameters of the more complex model to a particular value (often zero). Nested random effects: multiple random effects that are hierarchi- cally structured, such as species within genus or subsites within sites within regions. Overdispersion: the occurrence of more variance in the data than predicted by a statistical model. Pearson residuals: residuals from a model which can be used to detect outliers and nonhomogeneity of variance. Random effects: factors whose levels are sampled from a larger population, or whose interest lies in the variation among them rather thanthespecificeffectsofeachlevel.Theparametersofrandomeffects arethe standarddeviationsof variation ata particularlevel(e.g.among experimental blocks). The precise definitions of ���fixed��� and ���random��� are controversial the status of particular variables depends on exper- imental design and context [16,53]. Restricted maximum likelihood (REML): an alternative to ML that estimates the random-effect parameters (i.e. standard deviations) averaged over the values of the fixed-effect parameters REML esti- mates of standard deviations are generally less biased than corre- sponding ML estimates. Review Trends in Ecology and Evolution Vol.24 No.3 128