Spam detection using character n-grams

9Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level, Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines, Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Kanaris, I., Kanaris, K., & Stamatatos, E. (2006). Spam detection using character n-grams. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3955 LNAI, pp. 95–104). Springer Verlag. https://doi.org/10.1007/11752912_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free