Abstract
Supervised and unsupervised learning have been the focus of critical research in the areas of\rmachine learning and artificial intelligence. In the literature, these two streams flow\rindependently of each other, despite their close conceptual and practical connections. In this\rwork we exclusively deal with the text classification aided by clustering scenario. This\rchapter provides a review and interpretation of the role of clustering in different fields of\rtext classification with an eye towards identifying the important areas of research. Drawing\rupon the literature review and analysis, we discuss several important research issues\rsurrounding text classification tasks and the role of clustering in support of these tasks. We\rdefine the problem, postulate a number of baseline methods, examine the techniques used,\rand classify them into meaningful categories.\rA standard research issue for text classification is the creation of compact representations of\rthe feature space and the discovery of the complex relationships that exist between features,\rdocuments and classes. There are several approaches that try to quantify the notion of\rinformation for the basic components of a text classification problem. Given the variables of\rinterest, sources of information about these variables can be compressed while preserving\rtheir information. Clustering is one of the approaches used in this context. In this vein, an\rimportant area of research where clustering is used to aid text classification is the area of\rdimensionality reduction. Clustering is used as a feature compression and/or extraction\rmethod: features are clustered into groups based on selected clustering criteria. Feature\rclustering methods create new, reduced-size event spaces by joining similar features into\rgroups. They define a similarity measure between features, and collapse similar features\rinto single events that no longer distinguish among their constituent features. Typically, the\rparameters of the cluster become the weighted average of the parameters of its constituent\rfeatures. Two types of clustering have been studied: i) one-way clustering, i.e. feature\rclustering based on the distributions of features in the documents or classes, and ii) coclustering, i.e. clustering both features and documents.\rA second research area of text classification where clustering has a lot to offer, is the area of\rsemi-supervised learning. Training data contain both labelled and unlabelled examples.\rObtaining a fully labelled training set is a difficult task; labelling is usually done using\rhuman expertise, which is expensive, time consuming, and error prone. Obtaining\runlabelled data is much easier since it involves collecting data that are known to belong to\rwww.intechopen.com\r234 Tools in Artificial Intelligence\rone of the classes without having to label them. Clustering is used as a method to extract\rinformation from the unlabelled data in order to boost the classification task. In particularly\rclustering is used: i) to create a training set from the unlabelled data, ii) to augment the\rtraining set with new documents from the unlabelled data, iii) to augment the dataset with\rnew features, and iv) to co-train a classifier.\rFinally, clustering in large-scale classification problems is another major research area in\rtext classification. A considerable amount of work is done on using clustering to reduce the\rtraining time of a classifier when dealing with large data sets. In particular, while SVM\rclassifiers (see (Burges, 1998) for a tutorial) have proved to be a great success in many areas,\rtheir training time is at least O(N2) for training data of size N, which makes them non\rfavourable for large data sets. The same problem applies to other classifiers as well. In this\rvein, clustering is used as a down-sampling pre-process to classification, in order to reduce\rthe size of the training set resulting in a reduced dimensionality and a smaller, less complex\rclassification problem, easier and quicker to solve. However, it should be noted that\rdimensionality reduction is not accomplished directly using clustering as a feature\rreduction technique as discussed earlier, but rather in an indirect way through the removal\rof training examples that are most probably not useful to the classification task and the\rselection of the most representative redundant training set. In most of the cases this involves\rthe collaboration of both clustering and classification techniques.\rThe chapter is organized as follows: the next section presents a review of the literature on\rtext classification aided by clustering. It provides a comprehensive summary of the\ralternative views and applications of clustering discussed above and their implications for\rtext classification. A broader perspective on clustering and text classification research is then\rprovided by discussing important research themes that emerge from the review of the\rliterature and by classifying them into meaningful concept groups. We conclude by pointing\rout open issues and limitations of the techniques presented.
Cite
CITATION STYLE
Belingardi, G., Brunella, V., Martorana, B., & Ciardiello, R. (2016). Thermoplastic Adhesive for Automotive Applications. In Adhesives - Applications and Properties. InTech. https://doi.org/10.5772/65168
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.