An Unsupervised Approach for Content-Based Clustering of Emails into Spam and Ham through Multiangular Feature Formulation

17Citations
Citations of this article
58Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The rapid growth of spam email attacks and the inherent malicious dynamism within those attacks on a range of social, personal and business activities warrants an intelligent and automated anti-spam framework. Attempts like malware propagation, identity theft, sensitive data pilfering, monetary as well as reputational damage are sharply increasing, endangering the privacy of the victim. Current solutions that are rather incomplete when the multidimensional feature range of email, is taken into account. We believe a methodology based on Artificial Intelligence, especially unsupervised machine learning is the way forward. This research attempts to investigating the application of unsupervised learning for the clustering of Spam and Ham emails. The overall goal of the research is to develop an unsupervised framework that solely depends on unsupervised methodologies through a clustering approach that includes multiple algorithms, primarily using the email content (body) and the subject header. The clustering has been done on a novel binary dataset of 22,000 entries of ham and spam emails, composed of ten features (reduced from eleven to ten after the feature reduction). Seven out of these ten features are unique to this study, engineered to represent impactful analytical email characteristics from a multiangular point of view. Out of five different clustering algorithms investigated in this work, OPTICS produced the optimum clustering demonstrating a 0.26% higher average efficacy than its nearest performer DBSCAN. The average balanced accuracy for OPTICS and DBSCAN was found to be ≈75.76%.

Cite

CITATION STYLE

APA

Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K. (2021). An Unsupervised Approach for Content-Based Clustering of Emails into Spam and Ham through Multiangular Feature Formulation. IEEE Access, 9, 135186–135209. https://doi.org/10.1109/ACCESS.2021.3116128

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free