Addressing Classification on Highly Imbalanced Clinical Datasets

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

During the last twenty years, machine learning provided a myriad of frameworks and tools to improve data analyses in several fields. Classification, regression, clustering and dimensionality reduction techniques have been widely used in clinical studies to assist health professionals in screening, risk estimation, diagnostics and prognostics. Prospective studies often involve a long follow-up period and a large sample, therefore many investigations rely on a retrospective technique to develop precise classifiers. However, biological data usually presents a limited number of samples and imbalanced number of classes, which affects classification performance. These issues can be alleviated by employing balancing techniques, which increase the number of samples of the minority classes (oversampling) and/or decrease the number of samples of the majority classes (undersampling). In this work, we propose an original framework to assess several balancing techniques, combining them with 3 out-of-the-box classifiers. We applied the combination of techniques to the AVOCADO clinical study, which consists of a database of patient information including cardiovascular death or survival. Our results from the retrospective analysis of this database showed that for training the algorithm to predict cardiovascular outcomes in both sexes, the best undersampling techniques were ENN, RENN and Near-Miss 3, while ADASYN and SMOTE were the best oversampling techniques. Regarding the classifier algorithms, Random Forest and Logistic Regression (with internal balancing parameter enabled) achieved the best results with both families of balancing techniques. Proper balancing techniques associated with feature importance analysis improved the identification of clinical patterns in the data, enabling detection of high risk patients. This approach can be used for personalized medicine, for improving patients survival and recovery.

Cite

CITATION STYLE

APA

Fonseca, A. B., Martins-Jr, D. C., Wicik, Z., Postula, M., & Simões, S. N. (2022). Addressing Classification on Highly Imbalanced Clinical Datasets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13254 LNBI, pp. 103–114). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-17531-2_9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free