Investigating class rarity in big data

Tawfiq Hasanin; Taghi M. Khoshgoftaar; Joffrey L. Leevy; Richard A. Bauder

Journal ArticleOPEN ACCESS

Investigating class rarity in big data

Journal of Big Data (2020) 7(1)

DOI: 10.1186/s40537-020-00301-0

16Citations

42Readers

Abstract

In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.

Author supplied keywords

Cite

CITATION STYLE

APA

Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Bauder, R. A. (2020). Investigating class rarity in big data. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00301-0

Investigating class rarity in big data

Abstract

Author supplied keywords

Cite

Register to see more suggestions