Classification based topic extraction using domain-specific vocabulary: a supervised approach

1Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

Recently, a probabilistic topic modelling approach, latent dirichlet allocation (LDA), has been extensively applied in the arena of document classification. However, classical LDA is an unsupervised algorithm implemented using a fixed number of topics without prior domain knowledge and generates different outcomes with the change in the order of documents. This article presents a comprehensive framework to evade the order effect and unsupervised probabilistic nature. First, the framework creates the vocabulary specific to the category using a weight-dependent model that extracts distinctive features suitable for supervised classification. Then, it transforms a classified cluster of documents from the domain corpus to the relevant topic making it more robust to noise. The framework was tested on a comprehensive collection of benchmark news datasets that vary in sample size, class characteristics, and classification tasks. In contrast to the conventional classification methods, the proposed framework achieved 95.56% and 95.23% accuracy when applied on two datasets, indicating that the proposed algorithm has a better classification capability. Furthermore, the topics extracted from the classified clusters are highly relevant to domain categories.

Cite

CITATION STYLE

APA

Kalra, V., Kashyap, I., & Kaur, H. (2022). Classification based topic extraction using domain-specific vocabulary: a supervised approach. Indonesian Journal of Electrical Engineering and Computer Science, 26(1), 442–449. https://doi.org/10.11591/ijeecs.v26.i1.pp442-449

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free