N-Gram-Based Text Categorization

William B Cavnar; John M Trenkle; Ann Arbor Mi

Journal Article

N-Gram-Based Text Categorization

Cavnar W
Trenkle J
Mi A

In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994) 161-175

N/ACitations

1.1kReaders

Abstract

Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems. We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification...

Cite

CITATION STYLE

APA

Cavnar, W. B., Trenkle, J. M., & Mi, A. A. (1994). N-Gram-Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 161–175. Retrieved from http://www.let.rug.nl/~vannoord/TextCat/textcat.pdf

N-Gram-Based Text Categorization

Abstract

Cite

Register to see more suggestions