The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example

Kara-Louise Royle; David A. Cairns

Journal ArticleOPEN ACCESS

The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example

Royle K
Cairns D

Diagnostic and Prognostic Research (2021) 5(1)

DOI: 10.1186/s41512-021-00103-9

N/ACitations

14Readers

Abstract

The United Kingdom Myeloma Research Alliance (UK-MRA) Myeloma Risk Profile is a prognostic model for overall survival. It was trained and tested on clinical trial data, aiming to improve the stratification of transplant ineligible (TNE) patients with newly diagnosed multiple myeloma. Missing data is a common problem which affects the development and validation of prognostic models, where decisions on how to address missingness have implications on the choice of methodology. Model building The training and test datasets were the TNE pathways from two large randomised multicentre, phase III clinical trials. Potential prognostic factors were identified by expert opinion. Missing data in the training dataset was imputed using multiple imputation by chained equations. Univariate analysis fitted Cox proportional hazards models in each imputed dataset with the estimates combined by Rubin’s rules. Multivariable analysis applied penalised Cox regression models, with a fixed penalty term across the imputed datasets. The estimates from each imputed dataset and bootstrap standard errors were combined by Rubin’s rules to define the prognostic model. Model assessment Calibration was assessed by visualising the observed and predicted probabilities across the imputed datasets. Discrimination was assessed by combining the prognostic separation D-statistic from each imputed dataset by Rubin’s rules. Model validation The D-statistic was applied in a bootstrap internal validation process in the training dataset and an external validation process in the test dataset, where acceptable performance was pre-specified. Development of risk groups Risk groups were defined using the tertiles of the combined prognostic index, obtained by combining the prognostic index from each imputed dataset by Rubin’s rules. The training dataset included 1852 patients, 1268 (68.47%) with complete case data. Ten imputed datasets were generated. Five hundred twenty patients were included in the test dataset. The D-statistic for the prognostic model was 0.840 (95% CI 0.716–0.964) in the training dataset and 0.654 (95% CI 0.497–0.811) in the test dataset and the corrected D-Statistic was 0.801. The decision to impute missing covariate data in the training dataset influenced the methods implemented to train and test the model. To extend current literature and aid future researchers, we have presented a detailed example of one approach. Whilst our example is not without limitations, a benefit is that all of the patient information available in the training dataset was utilised to develop the model. Both trials were registered; Myeloma IX- ISRCTN68454111 , registered 21 September 2000. Myeloma XI- ISRCTN49407852 , registered 24 June 2009.

Cite

CITATION STYLE

APA

Royle, K.-L., & Cairns, D. A. (2021). The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example. Diagnostic and Prognostic Research, 5(1). https://doi.org/10.1186/s41512-021-00103-9

The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example

Abstract

Cite

Register to see more suggestions