Sign up & Download
Sign in

Generalized linear mixed models: a practical guide for ecology and evolution.

by Benjamin M Bolker, Mollie E Brooks, Connie J Clark, Shane W Geange, John R Poulsen, M Henry H Stevens, Jada-Simone S White
Trends in Ecology & Evolution (2009)

Abstract

How should ecologists and evolutionary biologists analyze nonnormal data that involve random effects? Nonnormal data such as counts or proportions often defy classical statistical procedures. Generalized linear mixed models (GLMMs) provide a more flexible approach for analyzing nonnormal data when random effects are present. The explosion of research on GLMMs in the last decade has generated considerable uncertainty for practitioners in ecology and evolution. Despite the availability of accurate techniques for estimating GLMM parameters in simple cases, complex GLMMs are challenging to fit and statistical inference such as hypothesis testing remains difficult. We review the use (and misuse) of GLMMs in ecology and evolution, discuss estimation and inference and summarize 'best-practice' data analysis procedures for scientists facing this challenge.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Generalized linear mixed models: a practical guide for ecology and evolution.

o
n
Box
n,
SA
individuals or expression of a genetic disorder [3]), pro-
portions (e.g. sex ratios [4], infection rates [5] or mortality
mixed models (which incorporate random effects) and
generalized linear models (which handle nonnormal data
Reviewrates within groups) or counts (number of emerging seed-
lings [6], number of ticks on red grouse chicks [7] or clutch
sizes of storks [2]). Where basic statistical methods try to
quantify the exact effects of each predictor variable, EE
problems often involve random effects, whose purpose is
instead to quantify the variation among units. The most
familiar types of random effect are the blocks in exper-
iments or observational studies that are replicated across
sites or times. Random effects also encompass variation
among individuals (whenmultiple responses aremeasured
per individual, such as survival of multiple offspring or sex
ratios ofmultiple broods), genotypes, species and regions or
time periods. Whereas geneticists and evolutionary biol-
ogists have long been interested in quantifying the mag-
nitude of variation among genotypes [8–10], ecologists
have more recently begun to appreciate the importance
by using link functions and exponential family [e.g. nor-
mal, Poisson or binomial] distributions). GLMMs are the
best tool for analyzing nonnormal data that involve ran-
dom effects: all one has to do, in principle, is specify a
distribution, link function and structure of the random
effects. For example, in Box 1, we use a GLMM to quantify
the magnitude of the genotype–environment interaction in
the response ofArabidopsis to herbivory. To do so, we select
a Poisson distribution with a logarithmic link (typical for
count data) and specify that the total number of fruits per
plant and the responses to fertilization and clipping could
vary randomly across populations and across genotypes
within a population.
However, GLMMs are surprisingly challenging to use
even for statisticians. Although several software packages
can handle GLMMs (Table 1), few ecologists and evolution-
ary biologists are aware of the range of options or of the
possible pitfalls. In reviewing papers in EE since 2005Corresponding author: Bolker, B.M. (bolker@ufl.edu).Generalized linear m
practical guide for e
evolution
Benjamin M. Bolker1, Mollie E. Brooks1, C
John R. Poulsen1, M. Henry H. Stevens3 a
1Department of Botany and Zoology, University of Florida, PO
2School of Biological Sciences, Victoria University of Wellingto
3Department of Botany, Miami University, Oxford, OH 45056, U
How should ecologists and evolutionary biologists
analyze nonnormal data that involve random effects?
Nonnormal data such as counts or proportions often defy
classical statistical procedures. Generalized linear mixed
models (GLMMs) provide a more flexible approach for
analyzing nonnormal data when random effects are pre-
sent. The explosion of research on GLMMs in the last
decade has generated considerable uncertainty for prac-
titioners in ecology and evolution. Despite the availability
of accurate techniques for estimating GLMM parameters
in simple cases, complex GLMMs are challenging to fit
and statistical inference such as hypothesis testing
remains difficult. We review the use (and misuse) of
GLMMs in ecology and evolution, discuss estimation
and inferenceandsummarize ‘best-practice’ dataanalysis
procedures for scientists facing this challenge.
Generalized linear mixed models: powerful but
challenging tools
Data sets in ecology and evolution (EE) often fall outside
the scope of the methods taught in introductory statistics
classes. Where basic statistics rely on normally distributed
data, EE data are often binary (e.g. presence or absence of a
species in a site [1], breeding success [2], infection status of0169-5347/$ – see front matter  2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.tree.2008.ixed models: a
cology and
nnie J. Clark1, Shane W. Geange2,
d Jada-Simone S. White1
118525, Gainesville, FL 32611-8525, USA
PO Box 600, Wellington 6140, New Zealand
of random variation in space and time [11] or among
individuals [12]. Theoretical studies emphasize the effects
of variability on population dynamics [13,14]. In addition,
estimating variability allows biologists to extrapolate stat-
istical results to individuals or populations beyond the
study sample.
Researchers faced with nonnormal data often try short-
cuts such as transforming data to achieve normality and
homogeneity of variance, using nonparametric tests or rely-
ing on the robustness of classical ANOVA to nonnormality
for balanced designs [15]. Theymight ignore random effects
altogether (thus committing pseudoreplication) or treat
them as fixed factors [16]. However, such shortcuts can fail
(e.g. count data with many zero values cannot be made
normal by transformation). Even when they succeed, they
might violate statistical assumptions (even nonparametric
tests make assumptions, e.g. of homogeneity of variance
across groups) or limit the scope of inference (one cannot
extrapolate estimates of fixed effects to new groups).
Instead of shoehorning their data into classical statisti-
cal frameworks, researchers should use statistical
approaches that match their data. Generalized linear
mixed models (GLMMs) combine the properties of two
statistical frameworks that are widely used in EE, linear10.008 127
Page 2
hidden
Glossary
Bayesian statistics: a statistical framework based on combining data
with subjective prior information about parameter values in order to
derive posterior probabilities of different models or parameter
values.
Bias: inaccuracy of estimation, specifically the expected difference
between an estimate and the true value.
Block random effects: effects that apply equally to all individuals
within a group (experimental block, species, etc.), leading to a single
level of correlation within groups.
Continuous random effects: effects that lead to between-group cor-
relations that vary with distance in space, time or phylogenetic
history.
Crossed random effects: multiple random effects that apply indepen-
dently to an individual, such as temporal and spatial blocks in the same
design, where temporal variability acts on all spatial blocks equally.
Exponential family: a family of statistical distributions including the
normal, binomial, Poisson, exponential and gamma distributions.
Fixed effects: factors whose levels are experimentally determined or
whose interest lies in the specific effects of each level, such as effects
of covariates, differences among treatments and interactions.
Frequentist (sampling-based) statistics: a statistical framework
based on computing the expected distributions of test statistics in
repeated samples of the same system. Conclusions are based on the
probabilities of observing extreme events.
Generalized linear models (GLMs): statistical models that assume
errors from the exponential family; predicted values are determined
by discrete and continuous predictor variables and by the link func-
tion (e.g. logistic regression, Poisson regression) (not to be confused
with PROC GLM in SAS, which estimates general linear models such
as classical ANOVA.).
Individual random effects: effects that apply at the level of each
individual (i.e. ‘blocks’ of size 1).
Information criteria and information-theoretic statistics: a statistical
framework based on computing the expected relative distance of
competing models from a hypothetical true model.
Linear mixed models (LMMs): statistical models that assume
normally distributed errors and also include both fixed and
random effects, such as ANOVA incorporating a random effect.
Link function: a continuous function that defines the response of
variables to predictors in a generalized linear model, such as logit
and probit links. Applying the link function makes the expected value
of the response linear and the expected variances homogeneous.
Markov chain Monte Carlo (MCMC): a Bayesian statistical technique
that samples parameters according to a stochastic algorithm that con-
verges on the posterior probability distribution of the parameters,
combining information from the likelihood and the posterior distri-
butions.
Maximum likelihood (ML): a statistical framework that finds the
parameters of a model that maximizes the probability of the observed
data (the likelihood). (See Restricted maximum likelihood.)
Model selection: any approach to determining the best of a set of
candidate statistical models. Information-theoretic tools such as AIC,
which also allow model averaging, are generally preferred to older
methods such as stepwise regression.
Nested models: models that are subsets of a more complex model,
derived by setting one or more parameters of the more complex
model to a particular value (often zero).
Nested random effects: multiple random effects that are hierarchi-
cally structured, such as species within genus or subsites within sites
within regions.
Overdispersion: the occurrence of more variance in the data than
predicted by a statistical model.
Pearson residuals: residuals from a model which can be used to
detect outliers and nonhomogeneity of variance.
Random effects: factors whose levels are sampled from a larger
population, or whose interest lies in the variation among them rather
than thespecificeffectsofeach level.Theparametersof randomeffects
are the standarddeviationsof variation at a particular level (e.g. among
experimental blocks). The precise definitions of ‘fixed’ and ‘random’
are controversial; the status of particular variables depends on exper-
imental design and context [16,53].
Review
128found by Google Scholar, 311 out of 537 GLMM analyses
(58%) used these tools inappropriately in some way (see
online supplementary material). Here we give a broad but
practical overview of GLMM procedures.
Whereas GLMMs themselves are uncontroversial,
describing how to use them to analyze data necessarily
touches oncontroversial statistical issues suchas thedebate
over null hypothesis testing [17], the validity of stepwise
regression [18] and the use of Bayesian statistics [19].
Others have thoroughly discussed these topics (e.g. [17–
19]); we acknowledge the difficulty while remaining agnos-
tic. We first discuss the estimation algorithms available for
fittingGLMMs todata tofindparameter estimates.We then
describe the inferential procedures for constructing confi-
dence intervals on parameters, comparing and selecting
models and testing hypotheses with GLMMs. Finally, we
summarize reasonable ‘best practices’ for using these tech-
niques to answer ecological and evolutionary questions.
Estimation
Estimating the parameters of a statistical model is a key
step in most statistical analyses. For GLMMs, these
parameters are the fixed-effect parameters (effects of cov-
ariates, differences among treatments and interactions: in
Box 1, these are the overall fruit set per individual and the
effects of fertilization, clipping and their interaction on
fruit set) and random-effect parameters (the standard
deviations of the random effects: in Box 1, variation in
fruit set, fertilization, clipping and interaction effects
across genotypes and populations). Many modern statisti-
cal tools, includingGLMMestimation, fit these parameters
by maximum likelihood (ML). For simple analyses where
the response variables are normal, all treatments have
equal sample sizes (i.e. the design is balanced) and all
random effects are nested effects, classical ANOVA
methods based on computing differences of sums of squares
give the same answers as ML approaches. However, this
equivalence breaks down for more complex LMMs or for
GLMMs: to find ML estimates, one must integrate like-
lihoods over all possible values of the random effects
([20,21] Box 2). For GLMMs this calculation is at best
slow, and atworst (e.g. for large numbers of random effects)
computationally infeasible.
Statisticians have proposed various ways to approxi-
mate the likelihood to estimate GLMM parameters, in-
cluding pseudo- and penalized quasilikelihood (PQL [22–
24]), Laplace approximations [25] and Gauss-Hermite
Restricted maximum likelihood (REML): an alternative to ML that
estimates the random-effect parameters (i.e. standard deviations)
averaged over the values of the fixed-effect parameters; REML esti-
mates of standard deviations are generally less biased than corre-
sponding ML estimates.
Trends in Ecology and Evolution Vol.24 No.3quadrature (GHQ [26]), as well as Markov chain Monte
Carlo (MCMC) algorithms [27] (Table 1). In all of these
approaches, one must distinguish between standard ML
estimation, which estimates the standard deviations of the
random effects assuming that the fixed-effect estimates are
precisely correct, and restricted maximum likelihood
(REML) estimation, a variant that averages over some
of the uncertainty in the fixed-effect parameters [28,29].

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

670 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Ph.D. Student
 
19% Post Doc
 
8% Student (Master)
by Country
 
23% United States
 
13% United Kingdom
 
7% Germany