Improving software-quality predictions with data sampling and boosting

  • Seiffert C
  • Khoshgoftaar T
  • Van Hulse J
  • 64


    Mendeley users who have this article in their library.
  • 51


    Citations of this article.


Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.

Author-supplied keywords

  • Binary classification
  • Boosting
  • Class imbalance
  • Classification
  • Sampling
  • Software quality

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • Chris Seiffert

  • Taghi M. Khoshgoftaar

  • Jason Van Hulse

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free