Simultaneous learning of sentence clustering and class prediction for improved document classification

4Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

In document classification it is common to represent a document as the so called bag-of-words form, which is essentially a global term distribution indicating how often certain terms appear in a text. Ignoring the spatial statistics (i.e., where in a text they appear) can potentially lead to a suboptimal solution. The key motivation or assumption in this paper is that there may exist underlying segmentation of sentences in a document, and perhaps this partitioning might be intuitively appealing (e.g., each group corresponds to a particular sentiment or gist of arguments). If the segmentation is known somehow, terms belonging to the same/different groups can potentially be treated in an equal/different manner for classification. Based on the idea, we build a novel document classification model comprised of two parts: a sentence tagger that predicts the group labels of sentences, and a classifier that forms the input features as a weighted term frequency vector that is aggregated from all sentences but weighed differently cluster-wise according to the prediction in the first model. We suggest an efficient learning strategy for this model. For several benchmark document classification problems, we demonstrate that the proposed approach yields significantly improved classification performance over several existing algorithms.

References Powered by Scopus

An algorithm for suffix stripping

5758Citations
N/AReaders
Get full text

Text categorization with support vector machines: Learning with many relevant features

4925Citations
N/AReaders
Get full text

Supervised and traditional term weighting methods for automatic text categorization

479Citations
N/AReaders
Get full text

Cited by Powered by Scopus

A computational approach for printed document forensics using SURF and ORB features

21Citations
N/AReaders
Get full text

Patent data analysis using functional count data model

10Citations
N/AReaders
Get full text

A scalable feature based clustering algorithm for sequences with many distinct items

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Kim, M. (2017). Simultaneous learning of sentence clustering and class prediction for improved document classification. International Journal of Fuzzy Logic and Intelligent Systems, 17(1), 35–42. https://doi.org/10.5391/IJFIS.2017.17.1.35

Readers' Seniority

Tooltip

Lecturer / Post doc 1

50%

PhD / Post grad / Masters / Doc 1

50%

Readers' Discipline

Tooltip

Computer Science 2

50%

Physics and Astronomy 1

25%

Engineering 1

25%

Save time finding and organizing research with Mendeley

Sign up for free