Bias vs Variance Decomposition for Regression and Classification

  • Geurts P
N/ACitations
Citations of this article
36Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this chapter, the important concepts of bias and variance are introduced. After an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decom-positions of the mean square error (in the context of regression problems) and of the mean misclassification error (in the context of classification problems). Then, we carry out a small empirical study providing some insight about how the parameters of a learning algorithm influence bias and variance. The general problem of supervised learning is often formulated as an optimization problem. An error measure is defined that evaluates the quality of a model and the goal of learning is to find, in a family of models (the hypothesis space), a model that minimizes this error estimated on the learning sample (or dataset) S. So, at first sight, if no good enough model is found in this family, it should be sufficient to extend the family or to exchange it for a more powerful one in terms of model flexibility. However, we are often interested in a model that generalizes well on unseen data rather than on a model that perfectly predicts the output for the learning sample cases. And, unfortunately, in practice, good results on the learning set do not necessarily imply good generalization performance on unseen data, especially if the "size" of the hypothesis space is large in comparison to the sample size. Let us use a simple one-dimensional regression problem to explain intuitively why larger hypothesis spaces do not necessarily lead to better models. In this synthetic problem, learning outputs are generated according to y = f b (x)+ε, where f b is represented by the dashed curves in Figure 39.1 and ε is distributed according to a Gaussian N(0,σ) distribution. With squared error loss, we will see below that the best possible model for this problem is f b and its average squared error is σ 2. Let us consider two extreme situations of a bad model structure choice. • A too simple model: using a linear model y = w.x + b and minimizing squared error on the learning set, we obtain the estimations given in the left part of Figure 39.1 for two O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

Cite

CITATION STYLE

APA

Geurts, P. (2009). Bias vs Variance Decomposition for Regression and Classification. In Data Mining and Knowledge Discovery Handbook (pp. 733–746). Springer US. https://doi.org/10.1007/978-0-387-09823-4_37

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free