A clustering-stratified cross-validation framework for validating omics survival models: application to head and neck cancer

1Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: This study tackles the challenge of developing reliable prognostic models for time-to-event (TTE) outcomes using high-dimensional omics data in head and neck cancers. Resampling methods, particularly nested cross-validation, are considered as standard for model hyperparameter selection and performance evaluation. When handling clustered data, balancing the random partition of the cross-validation folds to minimize optimism bias and instability could be tested. This work compares the performance of three nested cross-validation implementations, including random assignment of the folds, clustering-based resampling, and internal-external validation using an hold out approach. Method: We analyzed two head and neck squamous cell carcinoma (HNSCC) cohorts: The Cancer Genome Atlas (TCGA) and SCANDARE (NCT03017573), with clinical data and transcriptomic data normalized as log-transcripts per million. Three model selection methods LASSO, IPF-Lasso, and Priority-LASSO were evaluated within five nested cross-validation frameworks: Standard nested cross-validation, Clustering-based nested-cross validation, nested-cross validation with Combat correction, Nested cross-validation for optimization combined with hold-out for validation, Nested cross-validation for optimization combined with hold-out and ComBat correction for validation. Predictive performance was assessed using 3-year AUC and Integrated Brier Score (IBS). Results: We analyzed data from 581 patients (mean age 61.0 years, 33.6% female) across TCGA-HNSC (n = 505) and SCANDARE (n = 76). Clustering analyses, using UMAP and k-means, identified three transcriptomic clusters. Validation strategies demonstrated reduced instability for Lasso (p < 0.001), IPF-Lasso (p < 0.001) and Priority-lasso (p < 0.001) without apparent optimism in discrimination and calibration metrics with stratified nested cross-validation (SNCV), supporting its utility. As an application using IPF-Lasso Cox models with SNCV, we integrated clinical and transcriptomic data, selecting 35 prognosis variables of head and neck carcinomas. This model achieved a 3-year AUC of 0.71 and IBS of 0.08. Conclusion: Clustering-based nested cross-validation combined with stratified cross-validation offers a robust compromise for developing high-dimensional survival models and evaluating their predictive performance. This approach leverages clustering-derived stratification to balance heterogeneity in the dataset within cross-validation folds, although the training and test sets remain derived from the pooled dataset rather than fully independent cohorts.

Cite

CITATION STYLE

APA

Dubray-Vautrin, A., Choussy, O., Lamy, C., Marret, G., Martin, J., Klijanienko, J., … Mullaert, J. (2025). A clustering-stratified cross-validation framework for validating omics survival models: application to head and neck cancer. BMC Medical Research Methodology, 25(1). https://doi.org/10.1186/s12874-025-02709-9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free