We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams. © Springer-Verlag Berlin Heidelberg 2007.
CITATION STYLE
Narisawa, K., Bannai, H., Hatano, K., & Takeda, M. (2007). Unsupervised spam detection based on string alienness measures. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4755 LNAI, pp. 161–172). Springer Verlag. https://doi.org/10.1007/978-3-540-75488-6_16
Mendeley helps you to discover research relevant for your work.