The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study

0Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: While it is advocated that the use of unstructured data extracted from medical records is important for enhancing machine learning models, few studies have evaluated whether this occurs. A retrospective, head-to-head comparative study was conducted to evaluate machine learning models for in-hospital mortality prediction. The study assessed and quantified the potential performance improvement resulting from the inclusion of unstructured data. Methods: Hospitalizations of patients with a confirmed COVID-19 diagnosis at a tertiary teaching hospital specialized in emergency care were selected (n = 844). For the models with structured data, 21 variables were selected from laboratory tests and patient monitoring. For the hybrid models, an additional 21 clinical assertions (e.g., “has_symptom affirmed dyspnea”) were included. Six models with the best discriminative performance out of 11 trained and validated were selected for the testing phase. The most representative variables were evaluated using an explainable artificial intelligence model. Results: The random forest model demonstrated the highest performance, achieving an area under the receiver operating characteristic curve (AUC ROC) of 0.9260, an increase from 0.9170 when using only structured data. The inclusion of unstructured data also improved sensitivity from 0.8108 to 0.8378 while specificity was maintained at 0.8667. However, these performance improvements were not found to be statistically significant different from models with only structured data. Conclusion: The study concluded that the inclusion of unstructured data did not increase the predictive power of machine learning models for COVID-19 mortality. It was also determined that human involvement is crucial for implementation, specifically for validating natural language processing (NLP) outputs and tailoring the selection of unstructured features, given the inherent challenges in processing such data. Clinical trial number: Not applicable.

Cite

CITATION STYLE

APA

da Silva, R. P., & Pazin-Filho, A. (2025). The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study. BMC Medical Informatics and Decision Making, 25(1). https://doi.org/10.1186/s12911-025-03178-2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free