Improving software-quality predictions with data sampling and boosting

Chris Seiffert; Taghi M. Khoshgoftaar; Jason Van Hulse

Journal Article

Improving software-quality predictions with data sampling and boosting

IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans (2009) 39(6) 1283-1294

DOI: 10.1109/TSMCA.2009.2027131

94Citations

142Readers

Get full text

Abstract

Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the classimbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five datasampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models. © 2009 IEEE.

Author supplied keywords

Cite

CITATION STYLE

APA

Seiffert, C., Khoshgoftaar, T. M., & Van Hulse, J. (2009). Improving software-quality predictions with data sampling and boosting. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans, 39(6), 1283–1294. https://doi.org/10.1109/TSMCA.2009.2027131

Improving software-quality predictions with data sampling and boosting

Abstract

Author supplied keywords

Cite

Register to see more suggestions