Sign up & Download
Sign in

Selecting pseudo-absences for species distribution models: how, where and how many?

by Morgane Barbet-Massin, Frédéric Jiguet, Cécile Hélène Albert, Wilfried Thuiller
Methods in Ecology and Evolution (2012)

Abstract

Summary 1. Species distribution models are increasingly used to address questions in conservation biology, ecology and evolution. The most effective species distribution models require data on both species presence and the available environmental conditions (known as background or pseudo-absence data) in the area. However, there is still no consensus on how and where to sample these pseudo-absences and how many. 2. In this study, we conducted a comprehensive comparative analysis based on simple simulated species distributions to propose guidelines on how, where and how many pseudo-absences should be generated to build reliable species distribution models. Depending on the quantity and quality of the initial presence data (unbiased vs. climatically or spatially biased), we assessed the relative effect of the method for selecting pseudo-absences (random vs. environmentally or spatially stratified) and their number on the predictive accuracy of seven common modelling techniques (regression, classification and machine-learning techniques). 3. When using regression techniques, the method used to select pseudo-absences had the greatest impact on the models predictive accuracy. Randomly selected pseudo-absences yielded the most reliable distribution models. Models fitted with a large number of pseudo-absences but equally weighted to the presences (i.e. the weighted sum of presence equals the weighted sum of pseudo-absence) produced the most accurate predicted distributions. For classification and machine-learning techniques, the number of pseudo-absences had the greatest impact on model accuracy, and averaging several runs with fewer pseudo-absences than for regression techniques yielded the most predictive models. 4. Overall, we recommend the use of a large number (e.g. 10 000) of pseudo-absences with equal weighting for presences and absences when using regression techniques (e.g. generalised linear model and generalised additive model); averaging several runs (e.g. 10) with fewer pseudo-absences (e.g. 100) with equal weighting for presences and absences with multiple adaptive regression splines and discriminant analyses; and using the same number of pseudo-absences as available presences (averaging several runs if few pseudo-absences) for classification techniques such as boosted regression trees, classification trees and random forest. In addition, we recommend the random selection of pseudo-absences when using regression techniques and the random selection of geographically and environmentally stratified pseudo-absences when using classification and machine-learning techniques.

Cite this document (BETA)

Available from doi.wiley.com
Page 1
hidden

Selecting pseudo-absences for species distribution models: how, where and how many?

Selecting pseudo-absences for species distribution
models: how, where and how many?
Morgane Barbet-Massin1*, Fre´de´ric Jiguet1, Ce´cile He´le`ne Albert2,3 and Wilfried Thuiller3
1Muse´um National d’Histoire Naturelle, UMR 7204 MNHN-CNRS-UPMC, Centre de Recherches sur la Biologie des
Populations d’Oiseaux, CP 51, 55 Rue Buffon, 75005 Paris, France; 2Department of Biology, McGill University, 1205
Docteur Penfield, Montre´al, QC, Canada; and 3Laboratoire d’Ecologie Alpine, UMR-CNRS 5553, Universite´ Joseph
Fourier, Grenoble I, BP 53, 38041 Grenoble Cedex 9, France
Summary
1. Species distribution models are increasingly used to address questions in conservation biology,
ecology and evolution. The most effective species distribution models require data on both species
presence and the available environmental conditions (known as background or pseudo-absence
data) in the area. However, there is still no consensus on how and where to sample these pseudo-
absences and howmany.
2. In this study, we conducted a comprehensive comparative analysis based on simple simulated
species distributions to propose guidelines on how, where and how many pseudo-absences should
be generated to build reliable species distribution models. Depending on the quantity and quality of
the initial presence data (unbiased vs. climatically or spatially biased), we assessed the relative effect
of the method for selecting pseudo-absences (random vs. environmentally or spatially stratified)
and their number on the predictive accuracy of seven common modelling techniques (regression,
classification andmachine-learning techniques).
3. When using regression techniques, the method used to select pseudo-absences had the greatest
impact on the model’s predictive accuracy. Randomly selected pseudo-absences yielded the most
reliable distribution models. Models fitted with a large number of pseudo-absences but equally
weighted to the presences (i.e. the weighted sum of presence equals the weighted sum of pseudo-
absence) produced the most accurate predicted distributions. For classification and machine-learn-
ing techniques, the number of pseudo-absences had the greatest impact on model accuracy, and
averaging several runs with fewer pseudo-absences than for regression techniques yielded the most
predictivemodels.
4. Overall, we recommend the use of a large number (e.g. 10 000) of pseudo-absences with equal
weighting for presences and absences when using regression techniques (e.g. generalised linear
model and generalised additive model); averaging several runs (e.g. 10) with fewer pseudo-absences
(e.g. 100) with equal weighting for presences and absences with multiple adaptive regression splines
and discriminant analyses; and using the same number of pseudo-absences as available presences
(averaging several runs if few pseudo-absences) for classification techniques such as boosted regres-
sion trees, classification trees and random forest. In addition, we recommend the random selection
of pseudo-absences when using regression techniques and the random selection of geographically
and environmentally stratified pseudo-absences when using classification and machine-learning
techniques.
Key-words: background data, bias, biomod, ecological niche modelling, sampling design, vir-
tual species
Introduction
Species distribution models (SDM) are increasingly used to
address numerous questions in conservation biology, ecology
*Corresponding author. E-mail: barbet@mnhn.fr
Correspondence site: http://www.respond2articles.com/MEE/
Methods in Ecology and Evolution doi: 10.1111/j.2041-210X.2011.00172.x
 2012 The Authors. Methods in Ecology and Evolution  2012 British Ecological Society
Page 2
hidden
and evolution (Guisan & Thuiller 2005). They have been used
to test biogeographical, ecological and evolutionary hypothe-
ses (Graham et al. 2004a), to predict species’ invasion and pro-
liferation (Peterson & Vieglais 2001), to assess the impact of
climate, land use and other environmental changes on species
distributions (Thuiller et al. 2005), to improve surveys for rare
species by identifying sites where the probability of occurrence
is high (Engler, Guisan & Rechsteiner 2004) and to support
conservation planning and reserve selection (Marini et al.
2009).
The SDM widely used in these studies can be categorised in
two groups: methods that only require presence data vs. those
that require both presence and absence data (Brotons et al.
2004). Contrary to popular belief, there are very few presence-
only SDM, the most common being rectilinear envelope (e.g.
BIOCLIM, Busby 1991) and distance-based envelope (e.g.
Mahalanobis distance, Farber & Kadmon 2003). SDM such
as Maxent or GARP, sometimes misleadingly referred to as
presence-only methods, actually do require the use of back-
ground data or pseudo-absence data. As confirmed absences
are very difficult to obtain, especially for mobile species, and
require higher levels of sampling effort to ensure their reliabil-
ity compared with presence data (Mackenzie & Royle 2005),
presence-only models have often been used to cope with the
lack of absence data (Graham et al. 2004b). However, compar-
isons of various SDM show that presence–absence models
tend to perform better than presence-only models (Elith et al.
2006). Thus, presence–absence models are increasingly used
when only presence data is available, by creating artificial
absence data (usually called pseudo-absences or background
data).
As false absence data can have negative effects on SDM (Gu
& Swihart 2004), different strategies have been proposed to
improve the selection of an appropriate pseudo-absence data
set. Some studies have suggested using pseudo-absence data
selected outside a pre-defined region based on a simple preli-
minary model or based on a minimum distance to the presence
(Zaniewski, Lehmann & Overton 2002; Engler, Guisan &
Rechsteiner 2004; Lobo, Jimenez-Valverde & Hortal 2010). If
presences of the studied species have been collected during field
surveys that also considered other species, such that bias in the
sampling design is the same for all species, better results can be
obtained by taking pseudo-absences within the presence points
of these other species (Phillips et al. 2009). To our knowledge,
the influence of the number of pseudo-absences selected has
rarely been investigated. For the Maxent technique, Phillips &
Dudik (2008) found that predictive accuracy was higher with
around 10 000 background pseudo-absences. Nevertheless,
prevalence (defined here as the ratio of the quantity of presence
data to the quantity of absence data used to fit the model) has
been shown to influence model accuracy (McPherson, Jetz &
Rogers 2004). Although very informative, most of these previ-
ous studies used empirical data without knowing the true dis-
tribution of the species, the sampling design or presence data
bias (for discussion on bias and sampling design, see Albert
et al. 2010). Indeed, besides the obvious problems related to
unreliable absence data, the presence data may also be biased
or incomplete, depending on the sampling scheme, accuracy of
the data and species detection probability (Barbet-Massin,
Thuiller & Jiguet 2010). Generalisation and application of the
conclusions of these empirical studies are therefore of limited
interest in general compared with conclusions from virtual
experiments where results or patterns can be compared with
the known truth (Zurell et al. 2010).
The goal of this study is to systematically test the effect of
known sources of variability related to the selection of pseudo-
absence data to deliver a comprehensive guideline on how,
where and how many pseudo-absences should be generated to
build unbiased and reliable SDM. Here, we aimed to answer
the following questions:
(a) Which ratio of presences ⁄ absences achieves the highest
model accuracy?
(b) What is the optimal number of replicate sets of pseudo-
absences?
(c) What is the optimal number and weighting scheme of
pseudo-absences per replicate?
(d) Which method for generating pseudo-absences results in
themost accuratemodels?
(e) How does bias in the sampling design influence the opti-
mal use of pseudo-absences?
(f) Which parameters (number of pseudo-absences, method
of generating pseudo-absences and weighting scheme) have
the greatest influence on the models’ predictive accuracy?
For each one of these six questions, we further tested for an
effect of the number of presences available and the choice of
the modelling technique, using seven different SDM. To do so,
we performed a comparative analysis based on virtual data.
We thus knew the species’ true distribution and were able to
simulate different realisations of this distribution that were
either unbiased or purposely biased geographically or climati-
cally. Geographically biased presence data could arise from
sampling along main roads or railways, or within a subset
of the countries where the species occurs (Kadmon, Farber &
Danin 2004; Albert et al. 2010). Geographical bias matches
some large-scale surveys like the North American Breeding
Bird Survey with sampling sites along the main roads or some
common data sets used for species distribution modelling
which follow political boundaries (e.g. European breeding
birds, Huntley et al. 2008). Climatically biased presence data
can result either from a spatially biased sampling design, that
is, when data from an area with climatically different charac-
teristics are missing (Barbet-Massin, Thuiller & Jiguet 2010),
or from sampling that was not carried out over the whole
environmental range of a given species, which is often the case
for species ranging from low to very high altitude, because the
latter is usually less thoroughly surveyed.
Methods
CREATING VIRTUAL SPECIES
To make sure that our results were not influenced by the choice of a
species and the peculiarities thereof, we created two geographically
distinct virtual species (Fig. S1). To produce the simplest possible
2 M. Barbet-Massin et al.
 2012 The Authors. Methods in Ecology and Evolution  2012 British Ecological Society, Methods in Ecology and Evolution

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

23 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
22% Researcher (at an Academic Institution)
 
22% Ph.D. Student
 
9% Student (Master)
by Country
 
22% United States
 
13% Germany
 
9% United Kingdom