Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classification accuracy in all cases.
CITATION STYLE
Lertnattee, V., & Theeramunkong, T. (2003). Term-length normalization for centroid-based text categorization. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (Vol. 2773 PART 1, pp. 850–856). Springer Verlag. https://doi.org/10.1007/978-3-540-45224-9_113
Mendeley helps you to discover research relevant for your work.