Presence-only data, strictly speaking, refers to observations of the locations where animals or plants are found. By recording selected covariates to characterize those locations we can identify the domain of habitat for a species. Indeed, the distribution of covariates associated with these location data can be used to define an " envelope " of sites where an organism has been located (Pearce and Boyce 2006). Such an envelope might be mapped to show where an organism could occur presuming that we have recorded an ecologically relevant assemblage of covariates that describe the realized niche. In practice, researchers attempt to contrast the distributions of covariates associated with a sample of an organism's recorded locations with a sample of random landscape locations. Doing so allows esti-mation of a Resource Selection Function (RSF) which can be used to predict which resource units (e.g., pixels) are likely to be selected by an animal. To force the data to fit the framework of logistic regression, a popular statistical method, the random landscape locations are presumed to be absences. These are sometimes called pseudo absences. In such a scheme, resource units with recorded locations will be assigned 1's and those drawn as random landscape locations will be assigned 0's. If these were true used and unused (= presence and absence) data, logistic regression could be used to estimate the probability of occupancy (or use) based on measurements of the set of covariates (MacKenzie et al 2006). In reality, however, the sample of used locations always will be a subset of the random landscape locations. In other words, the data reflect a sample of used resource units drawn from a larger pool of resource units that were available (use/available design) and they are being forced into a logistic regression framework for statistical convenience. Unfortunately the underlying premise of two exclusive categories in a logistic regression classification is not correct. Instead there is a problem with the 0's. If the organism is located, say within a pixel, we can be sure that that location is correctly classified as a used location. But the random landscape locations where the organism was not observed might be unused because of detection bias (MacKenzie et al 2006) or because sampling intensity was insufficient (Boyce et al 2002). Therefore we have an asymmetry of errors with the 0's harbouring much greater uncertainty than the 1's. Instead of trying to force the data into a statistical framework that is inappropriate for the data, it makes much more sense to recognize that the distribution of used locations comes directly from the distribution of random landscape locations. Indeed, this is precisely the statistical framework for applications of quantitative genetics to model natural selection (Manly 1985) that ultimately led to the development of RSFs by Lyman McDonald and Bryan Manly (Manly et al 1993). Seber (1984) developed the relevant statistical framework for this use/available design by identifying a logistic discriminant function that can be used to contrast a distribution of used resource units with those available. The RSF is the model that can be used to identify used resource units given a distribution of available resource units. If the two distributions are normally distributed, this selection function is an exponential model (Seber 1984). We can estimate coefficients for this RSF using software for logistic regression, essentially cheating the logistic regression MLE algorithm into estimating the logistic discriminant function (Johnson et al 2006). The predictive capability of this RSF can be evaluated using k-fold cross validation (Boyce et al 2002), thereby overcoming the inappropriate application of the
CITATION STYLE
Boyce, M. S. (2010). Presence-only data, pseudo-absences, and other lies about habitat selection. Ideas in Ecology and Evolution. https://doi.org/10.4033/iee.2010.3.6.c
Mendeley helps you to discover research relevant for your work.