Thresholding strategies for deep learning with highly imbalanced big data

Justin M. Johnson; Taghi M. Khoshgoftaar

Book Chapter

Thresholding strategies for deep learning with highly imbalanced big data

Springer Science and Business Media Deutschland GmbH, (2021), 199-227

DOI: 10.1007/978-981-15-6759-9_9

9Citations

9Readers

Get full text

Abstract

A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding. In this chapter, we relate DNN outputs to Bayesian a posteriori probabilities and suggest that the Default threshold of 0.5 is almost never optimal when training data is imbalanced. We simulate a wide range of class imbalance levels using three real-world data sets, i.e. positive class sizes of 0.03–90%, and we compare Default threshold results to two alternative thresholding strategies. The Optimal threshold strategy uses validation data or training data to search for the classification threshold that maximizes the geometric mean. The Prior threshold strategy requires no optimization, and instead sets the classification threshold to be the prior probability of the positive class. Multiple deep architectures are explored and all experiments are repeated 30 times to account for random error. Linear models and visualizations show that the Optimal threshold is strongly correlated with the positive class prior. Confidence intervals show that the Default threshold only performs well when training data is balanced and Optimal thresholds perform significantly better when training data is skewed. Surprisingly, statistical results show that the Prior threshold performs consistently as well as the Optimal threshold across all distributions. The contributions of this chapter are twofold: (1) illustrating the side effects of training deep models with highly imbalanced big data and (2) comparing multiple thresholding strategies for maximizing class-wise performance with imbalanced training data.

Cite

CITATION STYLE

APA

Johnson, J. M., & Khoshgoftaar, T. M. (2021). Thresholding strategies for deep learning with highly imbalanced big data. In Advances in Intelligent Systems and Computing (Vol. 1232, pp. 199–227). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-6759-9_9

Thresholding strategies for deep learning with highly imbalanced big data

Abstract

Cite

Register to see more suggestions