Instance Ranking Using Data Complexity Measures for Training Set Selection

Junaid Alam; T. Sobha Rani

Conference ProceedingsOPEN ACCESS

Instance Ranking Using Data Complexity Measures for Training Set Selection

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11941 LNCS 179-188

DOI: 10.1007/978-3-030-34869-4_20

0Citations

2Readers

Abstract

A classifier’s performance is dependent on the training set provided for the training. Hence training set selection holds an important place in the classification task. This training set selection plays an important role in improving the performance of the classifier and reducing the time taken for training. This can be done using various methods like algorithms, data-handling techniques, cost-sensitive methods, ensembles and so on. In this work, one of the data complexity measures, Maximum Fisher’s discriminant ratio (F1), has been used to determine the good training instances. This measure discriminates any two classes using a specific feature by comparing the class means and variances. This measure in particular provides the overlap between the classes. In the first phase, F1 of the whole data set is calculated. After that, F1 using leave-one-out method is computed to rank each of the instances. Finally, the instances that lower the F1 value are all removed as a batch from the data set. According to F1, a small value represents a strong overlap between the classes. Therefore if those instances that cause more overlap are removed then overlap will reduce further. Empirically demonstrated in this work, the efficacy of the proposed reduction algorithm (DRF1) using 4 different classifiers (Random Forest, Decision Tree-C5.0, SVM and kNN) and 6 data sets (Pima, Musk, Sonar, Winequality(R and W) and Wisconsin). The results confirm that the DRF1 leads to a promising improvement in kappa statistics and classification accuracy with the training set selection using data complexity measure. Approximately 18–50% reduction is achieved. There is a huge reduction of training time also.

Author supplied keywords

Cite

CITATION STYLE

APA

Alam, J., & Sobha Rani, T. (2019). Instance Ranking Using Data Complexity Measures for Training Set Selection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11941 LNCS, pp. 179–188). Springer. https://doi.org/10.1007/978-3-030-34869-4_20

Instance Ranking Using Data Complexity Measures for Training Set Selection

Abstract

Author supplied keywords

Cite

Register to see more suggestions