The impact of overfitting and overgeneralization on the classification accuracy in data mining

Huy Nguyen Anh Pham; Evangelos Triantaphyllou

Book Chapter

The impact of overfitting and overgeneralization on the classification accuracy in data mining

Springer US, (2008), 391-431

DOI: 10.1007/978-0-387-69935-6_16

34Citations

45Readers

Get full text

Abstract

Many classification studies often times conclude with a summary table which presents performance results of applying various data mining approaches on different datasets. No single method outperforms all methods all the time. Furthermore, the performance of a classiffication method in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates, may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the HomogeneityBased Algorithm (or HBA) for optimally controlling the previous three error rates. This is done by first formulating an optimization problem. The key development in this chapter is based on a special way for analyzing the space of the training data and then partitioning it according to the data density of different regions of this space. Next, the classification task is pursued based on the previous partitioning of the training space. In this way, the previous three error rates can be controlled in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach has a significant potential to fill in a critical gap in current data mining methodologies. © 2008 Springer-Verlag US.

Author supplied keywords

Cite

CITATION STYLE

APA

Pham, H. N. A., & Triantaphyllou, E. (2008). The impact of overfitting and overgeneralization on the classification accuracy in data mining. In Soft Computing for Knowledge Discovery and Data Mining (pp. 391–431). Springer US. https://doi.org/10.1007/978-0-387-69935-6_16

The impact of overfitting and overgeneralization on the classification accuracy in data mining

Abstract

Author supplied keywords

Cite

Register to see more suggestions