Power law for text categorization

Wuying Liu; Lin Wang; Mianzhu Yi

Conference Proceedings

Power law for text categorization

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8202 LNAI 131-143

DOI: 10.1007/978-3-642-41491-6_13

4Citations

2Readers

Get full text

Abstract

Text categorization (TC) is a challenging issue, and the corresponding algorithms can be used in many applications. This paper addresses the online multi-category TC problem abstracted from the applications of online binary TC and batch multi-category TC. Most applications are concerned about the space-time performance of TC algorithms. Through the investigation of the token frequency distribution in an email collection and a Chinese web document collection, this paper re-examines the power law and proposes a random sampling ensemble Bayesian (RSEB) TC algorithm. Supported by a token level memory to store labeled documents, the RSEB algorithm uses a text retrieval approach to solve text categorization problems. The experimental results show that the RSEB algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements both in the TREC email spam filtering task and the Chinese web document classifying task. © Springer-Verlag 2013.

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, W., Wang, L., & Yi, M. (2013). Power law for text categorization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8202 LNAI, pp. 131–143). https://doi.org/10.1007/978-3-642-41491-6_13

Power law for text categorization

Abstract

Author supplied keywords

Cite

Register to see more suggestions