DEVELOPMENT OF A PREPROCESSING METHODOLOGY FOR IMBALANCED DATASETS IN MACHINE LEARNING TRAINING

Mykola Zlobin; Volodymyr Bazylevych

Journal ArticleOPEN ACCESS

DEVELOPMENT OF A PREPROCESSING METHODOLOGY FOR IMBALANCED DATASETS IN MACHINE LEARNING TRAINING

Technology Audit and Production Reserves (2025) 3(2) 55-61

DOI: 10.15587/2706-5448.2025.330639

1Citations

22Readers

Abstract

The object of the study is an imbalanced dataset of credit card transactions, where fraudulent cases represent only 0.18% of the total. One of the most problematic places is the inability of standard machine learning models to correctly detect rare fraud events, often resulting in high false-negative rates. This occurs because the models focus on the majority class, which leads to biased outcomes and undetected fraud. The presented analyses used a structured preprocessing pipeline to address this issue. It includes scaling of numeric values to eliminate bias, stratified sampling to preserve class proportions, random undersampling to balance the dataset, and outlier removal to reduce noise. These steps were applied before training three classification models: logistic regression (LR), K-Nearest Neighbors (KNN), and support vector classifier (SVC). The obtained results show that all models performed well in both cross-validation accuracy and ROC-AUC metrics, with SVC achieving the best ROC-AUC score of 0.9787. This is because the proposed preprocessing pipeline has many features customized to the characteristics of imbalanced data, in particular the combination of data balancing with careful filtering of noise and redundancy. This provides the possibility of achieving robust performance when detecting minority class events. Compared with similar known preprocessing workflows, it provides the following advantages: better class separation, reduced model bias, and improved generalization on unseen data. The results are especially relevant for financial institutions, where fraud detection must be both timely and accurate. The approach offers a practical method for improving security systems without requiring complex or high-cost infrastructure. It can also be adapted for use in other domains where rare events must be detected from large datasets. In future research, the pipeline can be extended by integrating synthetic sampling techniques such as SMOTE or GANs. Additional experiments with real-time streaming data will further validate the robustness of the proposed methodology.

Author supplied keywords

Cite

CITATION STYLE

APA

Zlobin, M., & Bazylevych, V. (2025). DEVELOPMENT OF A PREPROCESSING METHODOLOGY FOR IMBALANCED DATASETS IN MACHINE LEARNING TRAINING. Technology Audit and Production Reserves, 3(2), 55–61. https://doi.org/10.15587/2706-5448.2025.330639

DEVELOPMENT OF A PREPROCESSING METHODOLOGY FOR IMBALANCED DATASETS IN MACHINE LEARNING TRAINING

Abstract

Author supplied keywords

Cite

Register to see more suggestions