Filtering and Classifying Relevant Short Text with a Few Seed Words

12Citations
Citations of this article
24Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.

Cite

CITATION STYLE

APA

Li, C., Chen, S., & Qi, Y. (2019). Filtering and Classifying Relevant Short Text with a Few Seed Words. Data and Information Management, 3(3), 165–186. https://doi.org/10.2478/dim-2019-0011

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free