Machine learning (ML) is accurate and reliable in solving supervised problems such as classification, when the training is performed appropriately for the predefined classes. In real world scenario, during the dataset creation, class imbalance may arise, where one of the classes has huge number of instances while the other class has very less in numbers. In other words, the class distribution is not equal. Such scenarios results in anomalous prediction result. Handling of imbalanced dataset is therefore required to make correct prediction considering all the class scenarios in an equal ratio. The paper mentions various external and internal techniques to balance dataset found in literature survey along with experimental analysis of four different datasets from various domains- medical, mining, security, finance. The experiments are done using Python. External balancing techniques are used to balance the datasets- two oversampling SMOTE and ADASYN techniques and two undersampling Random Undersampling and Near Miss techniques. These datasets are used for binary classification task. Three machine learning classification algorithms such as logistic regression, random forest and decision tree are applied to imbalanced and balanced datasets to compare and contrast the performances. Comparisons with both balanced and unbalanced are reported. It has been found that undersample technique loses many important datapoints and thereby predicts with low accuracy. For all the datasets it is observed that oversampling technique ADASYN makes some decent prediction with appropriate balance.
CITATION STYLE
Goswami, T., & Roy, U. B. (2021). Classification Accuracy Comparison for Imbalanced Datasets with Its Balanced Counterparts Obtained by Different Sampling Techniques. In Lecture Notes in Electrical Engineering (Vol. 698, pp. 45–54). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-7961-5_5
Mendeley helps you to discover research relevant for your work.