Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records

12Citations
Citations of this article
52Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Background and aims The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. Methods We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). Results The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. Discussion Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application.

Cite

CITATION STYLE

APA

Hussan, H., Zhao, J., Badu-Tawiah, A. K., Stanich, P., Tabung, F., Gray, D., … Clinton, S. K. (2022). Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records. PLoS ONE, 17(3 March). https://doi.org/10.1371/journal.pone.0265209

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free