Data classification is one of the most important tasks in data mining, which identify to which categories a new observation belongs, on the basis of a training set. Preparing data before doing any data mining is essential step to ensure the quality of mined data. There are different algorithms used to solve classification problems. In this research four algorithms namely support vector machine (SVM), C5.0, K-nearest neighbor (KNN) and Recursive Partitioning and Regression Trees (rpart) are compared before and after applying two feature selection techniques. These techniques are Wrapper and Filter. This comparative study is implemented throughout using R programming language. Direct marketing campaigns dataset of banking institution is used to predict if the client will subscribe a term deposit or not. The dataset is composed of 4521 instances. 3521 instance as training set 78%, 1000 instance as testing set 22%. The results show that C5.0 is superior to other algorithms before implementing FS technique and SVM is superior to others after implementing FS. Keywords-Classification, Feature Selection, Wrapper Technique, Filter Technique, Support Vector Machine (SVM), C5.0, K-Nearest Neighbor (KNN), Recursive Partitioning and Regression Trees (Rpart). I. INTRODUCTION The problem of data classification has numerous applications in a wide variety of mining applications. This is because the problem attempts to learn the relationship between a set of feature variables and a target variable of interest. Excellent overviews on data classification may be found in Classification algorithms typically contain two phases. The first one is training phase in which a model is constructed from the training instances. The second is testing phase in which the model is used to assign a label to an unlabeled test instance[1]. Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set, called prediction set, which contains the same set of attributes, except for the prediction attribute-not yet known. The algorithm analyses the input and produces predicted instances. The prediction accuracy defines how "good" the algorithm is [2]. The four classifiers used in this paper are shown in (figure 1). But many irrelevant, noisy or ambiguous attributes may be present in data to be mined. So they need to be removed because it affects the performance of algorithms. Attribute selection methods are used to avoid over fitting and improve model performance and to provide faster and more cost-effective models [3]. The main purpose of Feature Selection (FS) approach is to select a minimal and relevant feature subset for a given dataset and maintain its original representation. FS not only reduces the dimensionality of data but also enhance the performance of a classifier. So, the task of FS is to search for best possible feature subset depending on the problem to be solved [4]. This paper is organized as follows. Section 2 refers to the four algorithms to deal with the classification problem. Section 3 describes the used FS techniques. Section 4 demonstrates our experimental methodology then section 5 presents the results. Finally section 6 provides conclusion and future work.
CITATION STYLE
Nasr, M. M., Shaaban, E. M., & Gabr, M. I. (2017). Comparative Study: Classification Algorithms Before and After Using Feature Selection Techniques. International Journal of Advanced Research in Computer Science and Software Engineering, 7(2), 31–38. https://doi.org/10.23956/ijarcsse/v7i2/01212
Mendeley helps you to discover research relevant for your work.