Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction

M. Fadli Kurniawan; Dyah Ayu Megawaty

Journal ArticleOPEN ACCESS

Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction

Kurniawan M
Megawaty D

Journal of Applied Informatics and Computing (2025) 9(5) 2154-2162

DOI: 10.30871/jaic.v9i5.9815

N/ACitations

80Readers

Abstract

Diabetes mellitus is a prevalent chronic illness that continues to grow in incidence worldwide, placing significant strain on healthcare systems. The timely prediction of diabetes is crucial for early intervention and management. This study explores the comparative effectiveness of four machine learning algorithms Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) in identifying diabetes cases using a large public dataset containing 100,000 patient records obtained from open source Kaggle. The dataset includes nine clinical variables, such as age, gender, body mass index (BMI), blood glucose level, and HbA1c levels, among others. To address class imbalance, which showed less than 10% positive (diabetic) cases initially, the Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training data after an 80:20 stratified split. All models were evaluated using 5-fold stratified cross-validation, measuring their performance through accuracy, precision, recall, F1-score, area under the ROC curve (AUC), and training time. Among the models, Random Forest achieved the highest classification accuracy (96.88%) and AUC (99.70%), indicating superior overall performance. Furthermore, McNemar statistical tests revealed that the differences in performance between Random Forest and the other models were statistically significant. An analysis of feature importance highlighted that HbA1c, glucose level, and BMI were the most influential predictors. These results demonstrate that Random Forest offers the most balanced combination of accuracy, interpretability, and robustness, making it highly suitable for real-world clinical screening scenarios where early detection of diabetes is critical.Diabetes mellitus is a prevalent chronic illness that continues to grow in incidence worldwide, placing significant strain on healthcare systems. The timely prediction of diabetes is crucial for early intervention and management. This study explores the comparative effectiveness of four machine learning algorithms Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) in identifying diabetes cases using a large public dataset containing 100,000 patient records obtained from open source Kaggle. The dataset includes nine clinical variables, such as age, gender, body mass index (BMI), blood glucose level, and HbA1c levels, among others. To address class imbalance, which showed less than 10% positive (diabetic) cases initially, the Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training data after an 80:20 stratified split. All models were evaluated using 5-fold stratified cross-validation, measuring their performance through accuracy, precision, recall, F1-score, area under the ROC curve (AUC), and training time. Among the models, Random Forest achieved the highest classification accuracy (96.88%) and AUC (99.70%), indicating superior overall performance. Furthermore, McNemar statistical tests revealed that the differences in performance between Random Forest and the other models were statistically significant. An analysis of feature importance highlighted that HbA1c, glucose level, and BMI were the most influential predictors. These results demonstrate that Random Forest offers the most balanced combination of accuracy, interpretability, and robustness, making it highly suitable for real-world clinical screening scenarios where early detection of diabetes is critical.

Cite

CITATION STYLE

APA

Kurniawan, M. F., & Megawaty, D. A. (2025). Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction. Journal of Applied Informatics and Computing, 9(5), 2154–2162. https://doi.org/10.30871/jaic.v9i5.9815

Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction

Abstract

Cite

Register to see more suggestions