Multi-Level Training and Testing of CNN Models in Diagnosing Multi-Center COVID-19 and Pneumonia X-ray Images

13Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

Featured Application: Despite their reported high accuracy, a significant limitation of current AI-assisted COVID-19 diagnostic models is that they are often trained on datasets sourced from specific clinics or possessing a limited number of training images. This raises an important question: Will these models maintain high accuracy when deployed in other clinics where images might exhibit disparities? If accuracy does drop, to what extent can we expect this decline? Conversely, how much can accuracy be improved by augmenting the training dataset with new images? In this study, we evaluated the performances of four CNN models that were trained on incrementally augmented datasets and subsequently tested on images with decreasing similarities. Through multi-level testing, we assessed the models’ capacities for verification, interpolation, and extrapolation in the context of diagnosing COVID-19 and pneumonia using multi-center X-ray images. Compared to conventional one-round training, multi-round training offers a more comprehensive insight into a model’s learnability, robustness, and interpretability. This study aimed to address three questions in AI-assisted COVID-19 diagnostic systems: (1) How does a CNN model trained on one dataset perform on test datasets from disparate medical centers? (2) What accuracy gains can be achieved by enriching the training dataset with new images? (3) How can learned features elucidate classification results, and how do they vary among different models? To achieve these aims, four CNN models—AlexNet, ResNet-50, MobileNet, and VGG-19—were trained in five rounds by incrementally adding new images to a baseline training set comprising 11,538 chest X-ray images. In each round, the models were tested on four datasets with decreasing levels of image similarity. Notably, all models showed performance drops when tested on datasets containing outlier images or sourced from other clinics. In Round 1, 95.2~99.2% accuracy was achieved for the Level 1 testing dataset (i.e., from the same clinic but set apart for testing only), and 94.7~98.3% for Level 2 (i.e., from an external clinic but similar). However, model performance drastically decreased for Level 3 (i.e., outlier images with rotation or deformation), with the mean sensitivity plummeting from 99% to 36%. For the Level 4 testing dataset (i.e., from another clinic), accuracy decreased from 97% to 86%, and sensitivity from 99% to 67%. In Rounds 2 and 3, adding 25% and 50% of the outlier images to the training dataset improved the average Level-3 accuracy by 15% and 23% (i.e., from 56% to 71% to 83%). In Rounds 4 and 5, adding 25% and 50% of the external images increased the average Level-4 accuracy from 81% to 92% and 95%, respectively. Among the models, ResNet-50 demonstrated the most robust performance across the five-round training/testing phases, while VGG-19 persistently underperformed. Heatmaps and intermediate activation features showed visual correlations to COVID-19 and pneumonia X-ray manifestations but were insufficient to explicitly explain the classification. However, heatmaps and activation features at different rounds shed light on the progression of the models’ learning behavior.

Cite

CITATION STYLE

APA

Talaat, M., Si, X., & Xi, J. (2023). Multi-Level Training and Testing of CNN Models in Diagnosing Multi-Center COVID-19 and Pneumonia X-ray Images. Applied Sciences (Switzerland), 13(18). https://doi.org/10.3390/app131810270

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free