Recently, a probabilistic topic modelling approach, latent dirichlet allocation (LDA), has been extensively applied in the arena of document classification. However, classical LDA is an unsupervised algorithm implemented using a fixed number of topics without prior domain knowledge and generates different outcomes with the change in the order of documents. This article presents a comprehensive framework to evade the order effect and unsupervised probabilistic nature. First, the framework creates the vocabulary specific to the category using a weight-dependent model that extracts distinctive features suitable for supervised classification. Then, it transforms a classified cluster of documents from the domain corpus to the relevant topic making it more robust to noise. The framework was tested on a comprehensive collection of benchmark news datasets that vary in sample size, class characteristics, and classification tasks. In contrast to the conventional classification methods, the proposed framework achieved 95.56% and 95.23% accuracy when applied on two datasets, indicating that the proposed algorithm has a better classification capability. Furthermore, the topics extracted from the classified clusters are highly relevant to domain categories.
CITATION STYLE
Kalra, V., Kashyap, I., & Kaur, H. (2022). Classification based topic extraction using domain-specific vocabulary: a supervised approach. Indonesian Journal of Electrical Engineering and Computer Science, 26(1), 442–449. https://doi.org/10.11591/ijeecs.v26.i1.pp442-449
Mendeley helps you to discover research relevant for your work.