A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding. In this chapter, we relate DNN outputs to Bayesian a posteriori probabilities and suggest that the Default threshold of 0.5 is almost never optimal when training data is imbalanced. We simulate a wide range of class imbalance levels using three real-world data sets, i.e. positive class sizes of 0.03–90%, and we compare Default threshold results to two alternative thresholding strategies. The Optimal threshold strategy uses validation data or training data to search for the classification threshold that maximizes the geometric mean. The Prior threshold strategy requires no optimization, and instead sets the classification threshold to be the prior probability of the positive class. Multiple deep architectures are explored and all experiments are repeated 30 times to account for random error. Linear models and visualizations show that the Optimal threshold is strongly correlated with the positive class prior. Confidence intervals show that the Default threshold only performs well when training data is balanced and Optimal thresholds perform significantly better when training data is skewed. Surprisingly, statistical results show that the Prior threshold performs consistently as well as the Optimal threshold across all distributions. The contributions of this chapter are twofold: (1) illustrating the side effects of training deep models with highly imbalanced big data and (2) comparing multiple thresholding strategies for maximizing class-wise performance with imbalanced training data.
CITATION STYLE
Johnson, J. M., & Khoshgoftaar, T. M. (2021). Thresholding strategies for deep learning with highly imbalanced big data. In Advances in Intelligent Systems and Computing (Vol. 1232, pp. 199–227). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-15-6759-9_9
Mendeley helps you to discover research relevant for your work.