Near-duplicate mail detection based on URL information for spam filtering

5Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching, With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Yeh, C. C., & Lin, C. H. (2006). Near-duplicate mail detection based on URL information for spam filtering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3961 LNCS, pp. 842–851). Springer Verlag. https://doi.org/10.1007/11919568_84

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free