From Model Selection to Adaptive Estimation

  • Birgé L
  • Massart P
N/ACitations
Citations of this article
40Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many different model selection information criteria can be found in the literature in various contexts including regression and density estimation. There is a huge amount of literature concerning this subject and we shall, in this paper, content ourselves to cite only a few typical references in order to illustrate our presentation. Let us just mention AIC, C p or C L , BIC and MDL criteria proposed by Akaike (1973), Mallows (1973), Schwarz (1978), and Rissanen (1978) respectively. These methods propose to select among a given collection of parametric models that model which minimizes an empirical loss (typically squared error or minus log-likelihood) plus some penalty term which is proportional to the dimension of the model. From one criterion to another the penalty functions differ by factors of log n, where n represents the number of observations. The reasons for choosing one penalty rather than another come either from information theory or Bayesian asymptotic computations or approx-imate evaluations of the risk on specific families of models. Many efforts were made to understand in what circumstances these criteria allow to identify the right model asymptotically (see Li (1987) for instance). Much less is known about the performances of the estimators provided by these methods from a nonparametric point of view. Let us consider the particular context of density estimation in L 2 for instance. By a nonparametric point of view, we mean that the unknown density does not necessarily belong to any of the given models and that the best model should approximately realize the best trade-off between the risk of estimation within the model and the distance of the unknown density to the model. When the mod-els have good approximation properties (following Grenander (1981) these models will be called sieves), an adequate choice of the penalty can produce adaptive estimators in the sense that they estimate a density of unknown 4. From Model Selection to Adaptive Estimation 56 smoothness at the rate which one would get if the degree of smoothness were known. Notable results in that direction have been obtained by Barron & Cover (1991) who use the MDL criterion when the models are chosen as ε-nets and by Polyak & Tsybakov (1990) who select the order of a Fourier expansion via Mallow's C p for regression. One should also mention the re-sults on penalized spline smoothing by Wahba and various coauthors (see Wahba (1990) for an extensive list of references). This paper is meant to illustrate by a few theorems and applications, mainly directed towards adaptive estimation in Besov spaces, the power and versatility of the method of penalized minimum contrast estimation on sieves. A more general approach to the theory will be given in the compan-ion paper Barron, Birgé & Massart (1995). We shall here content ourselves to consider linear sieves and the particular contrast which defines projec-tion estimators for density estimation. These restrictions will allow us to make an extensive use of a recent and very powerful exponential inequal-ity of Talagrand (1994) on the fluctuations of empirical processes which greatly simplifies the presentation and proofs. The choice of the penalty derives from the control of the risk on a fixed sieve. From that respect our approach presents some similarity with the method of structural min-imization of the risk of Vapnik (1982). Minimum contrast estimators on a fixed sieve have been studied in great detail in Birgé & Massart (1994). For projection estimators their results can roughly be summarized as fol-lows: s is an unknown density in L 2 (µ) to be estimated using a projection estimator acting on a linear sieve S of dimension D and the loss function is proportional to the square of the distance induced by the norm. Under reasonable conditions on the structure of the space S one gets a quadratic risk of the order of s − π(s) 2 + D/n if one denotes by π(s) the projection of s on S. This is essentially the classical decomposition between the square of the bias and the variance. The presence of a D/n term corresponding to a D-dimensional approximating space is not surprising for those who are familiar with Le Cam's developments about the connections between the dimension (in the metric sense) of a space and the minimax risk on this space. One should see Le Cam (1973) and (1986, Chapter 16) for further details. Our main purpose, in this paper, is to show that if we replace the single sieve S by a collection of linear sieves S m , m ∈ M n , with respective dimen-sions D m and suitable properties, and introduce a penalty function pen(m) of the form L(m)D m /n, one gets a risk which, up to some multiplicative constant, realizes the best trade-off between s − s m 2 and L(m)D m /n. Here s m is the best approximant of s in S m and L(m) is either uniformly bounded or possibly of order log n when too many of the sieves have the same dimension D m . Note also that pen(m) will be allowed to be random. We shall show that some more or less recently introduced methods of adap-tive density estimation like the unbiased cross validation (Rudemo 1982), or the hard thresholding of wavelet empirical coefficients (Donoho,

Cite

CITATION STYLE

APA

Birgé, L., & Massart, P. (1997). From Model Selection to Adaptive Estimation. In Festschrift for Lucien Le Cam (pp. 55–87). Springer New York. https://doi.org/10.1007/978-1-4612-1880-7_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free