Generalized term similarity for feature selection in text classification using quadratic programming

Hyunki Lim; Dae Won Kim

Journal ArticleOPEN ACCESS

Generalized term similarity for feature selection in text classification using quadratic programming

Entropy (2020) 22(4)

DOI: 10.3390/E22040395

7Citations

5Readers

Abstract

The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.

Author supplied keywords

Cite

CITATION STYLE

APA

Lim, H., & Kim, D. W. (2020). Generalized term similarity for feature selection in text classification using quadratic programming. Entropy, 22(4). https://doi.org/10.3390/E22040395

Generalized term similarity for feature selection in text classification using quadratic programming

Abstract

Author supplied keywords

Cite

Register to see more suggestions