Improving software-quality predictions with data sampling and boosting

  • Seiffert C
  • Khoshgoftaar T
  • Van Hulse J
  • 62

    Readers

    Mendeley users who have this article in their library.
  • 46

    Citations

    Citations of this article.

Abstract

Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.

Author-supplied keywords

  • Binary classification
  • Boosting
  • Class imbalance
  • Classification
  • Sampling
  • Software quality

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text

Authors

  • Chris Seiffert

  • Taghi M. Khoshgoftaar

  • Jason Van Hulse

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free