Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

7Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Imbalanced learning is a common problem in data mining. There is a different distribution of data samples among other classes in the imbalanced datasets. It's a challenge for standard algorithms designed for balanced class distributions. Although there are various strategies to solve this problem, generating artificial data to achieve a relatively balanced class distribution is universal rather than directly modifying specific classification algorithms. The oversampled data can be combined with any user-specified algorithm without any restrictions. In this paper, we present a novel oversampling method, Global Data Distribution Weighted Synthetic Oversampling Technique (GDDSYN). By applying clustering, optimizing the selection criteria of the minority class samples that are used to generate synthetic samples, avoiding generating more noise samples. GDDSYN assigns weights for the number of synthetic samples to tackle the within-class imbalance and between-class imbalance simultaneously, according to the informative level of the sample and the sparsity of the cluster to which the sample belongs. The use of scores with Silhouette Coefficient and Mutual Information helps the k-means algorithm set a reasonable number of clusters for the minority and majority classes respectively so that the clustering effect can be guaranteed. Next, by using clustering information, synthetic samples' generation path is improved to avoid class overlap. Additionally, GDDSYN has been evaluated extensively on 10 artificial and 10 real-world data sets. The empirical results show that our method is outperforms or comparable with some other existing methods in terms of assessment metrics when artificial data generated by GDDSYN are used.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Wang, Z., & Wang, H. (2021). Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning. IEEE Access, 9, 44770–44783. https://doi.org/10.1109/ACCESS.2021.3067060

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 9

75%

Lecturer / Post doc 2

17%

Researcher 1

8%

Readers' Discipline

Tooltip

Computer Science 9

75%

Engineering 2

17%

Arts and Humanities 1

8%

Save time finding and organizing research with Mendeley

Sign up for free