Near-duplicate mail detection based on URL information for spam filtering

Chun Chao Yeh; Chia Hui Lin

Conference Proceedings

Near-duplicate mail detection based on URL information for spam filtering

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 3961 LNCS 842-851

DOI: 10.1007/11919568_84

5Citations

5Readers

Get full text

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching, With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Yeh, C. C., & Lin, C. H. (2006). Near-duplicate mail detection based on URL information for spam filtering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3961 LNCS, pp. 842–851). Springer Verlag. https://doi.org/10.1007/11919568_84

Near-duplicate mail detection based on URL information for spam filtering

Abstract

Cite

Register to see more suggestions