Spam detection using character n-grams

Ioannis Kanaris; Konstantinos Kanaris; Efstathios Stamatatos

Conference Proceedings

Spam detection using character n-grams

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 3955 LNAI 95-104

DOI: 10.1007/11752912_12

9Citations

19Readers

Get full text

Abstract

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level, Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines, Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Kanaris, I., Kanaris, K., & Stamatatos, E. (2006). Spam detection using character n-grams. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3955 LNAI, pp. 95–104). Springer Verlag. https://doi.org/10.1007/11752912_12

Spam detection using character n-grams

Abstract

Cite

Register to see more suggestions