Term-length normalization for centroid-based text categorization

Verayuth Lertnattee; Thanaruk Theeramunkong

Conference Proceedings

Term-length normalization for centroid-based text categorization

Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (2003) 2773 PART 1 850-856

DOI: 10.1007/978-3-540-45224-9_113

4Citations

4Readers

Get full text

Abstract

Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classification accuracy in all cases.

Cite

CITATION STYLE

APA

Lertnattee, V., & Theeramunkong, T. (2003). Term-length normalization for centroid-based text categorization. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (Vol. 2773 PART 1, pp. 850–856). Springer Verlag. https://doi.org/10.1007/978-3-540-45224-9_113

Term-length normalization for centroid-based text categorization

Abstract

Cite

Register to see more suggestions