Incorporating auxiliary information for improved prediction in high-dimensional datasets: An ensemble of shrinkage approaches

13Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X. On a moderately sized sample, we have (Y,X,W), and on a larger sample we have (Y,W). We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p > n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients β of Y on X toward different targets that use information derived from W in the larger dataset. We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of β. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents. The methods, including a fully Bayesian alternative, are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample. © The Author 2012. Published by Oxford University Press. All rights reserved.

References Powered by Scopus

Ridge Regression: Biased Estimation for Nonorthogonal Problems

8390Citations
N/AReaders
Get full text

Smoothing noisy data with spline functions - Estimating the correct degree of smoothing by the method of generalized cross-validation

2630Citations
N/AReaders
Get full text

Some comments on C<inf>p</inf>

2538Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Environmental risk score as a new tool to examine multi-pollutants in epidemiologic research: An example from the NHANES study using serum lipid levels

59Citations
N/AReaders
Get full text

Combining Multiple Observational Data Sources to Estimate Causal Effects

46Citations
N/AReaders
Get full text

Combining parametric, semi-parametric, and non-parametric survival models with stacked survival models

30Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Boonstra, P. S., Taylor, J. M. G., & Mukherjee, B. (2013). Incorporating auxiliary information for improved prediction in high-dimensional datasets: An ensemble of shrinkage approaches. Biostatistics, 14(2), 259–272. https://doi.org/10.1093/biostatistics/kxs036

Readers over time

‘13‘14‘15‘16‘17‘18‘19‘20‘21‘23‘24‘2502468

Readers' Seniority

Tooltip

Researcher 7

37%

Professor / Associate Prof. 6

32%

PhD / Post grad / Masters / Doc 6

32%

Readers' Discipline

Tooltip

Medicine and Dentistry 8

38%

Mathematics 7

33%

Computer Science 5

24%

Nursing and Health Professions 1

5%

Save time finding and organizing research with Mendeley

Sign up for free
0