Efficient clustering of e-mails by applying supervised machine learning algorithms

D. Quirumbay Yagual; B. Soria Méndez; V. Cruz Ruiz

Journal ArticleOPEN ACCESS

Efficient clustering of e-mails by applying supervised machine learning algorithms

Journal of Applied Research and Technology (2024) 22(4) 560-566

DOI: 10.22201/icat.24486736e.2024.22.4.2383

1Citations

7Readers

Abstract

In today's digital age, effective detection of unwanted e-mails, commonly known as "spam", has become a priority for individuals and organizations. As e-mail inboxes fill up with un-solicited messages, it has become evident that the predefined rules and heuristics used by traditional spam filters have lost their effectiveness. This persistent problem poses challenges at both the personal and business level. Despite efforts to protect e-mail accounts with anti-virus, which in many cases come at a cost, spam remains a growing concern. For businesses, implementing costly firewalls can be an unnecessary burden. The problem of spam persists, and its impact on the efficiency and security of e-mail communication is indisputable. The primary objective of this paper is to investigate and evaluate machine learning algorithms specifically designed to address the challenge of automatic spam detection. This is achieved by using text classification techniques applied to mail servers and personal computers. Three key algorithms are examined: Random Forest, decision tree and Naive Bayes, with the intention of determining their applicability in both environments. This study relies on two essential research methodologies. First, feature selection, a crucial process that identifies the most relevant variables in mail classification, including keywords and word frequencies, is conducted. In addition, performance evaluation, which uses metrics such as accuracy, recall and F1-score, is employed to understand the performance of machine learning models in detecting spam and legitimate e-mails. The results of this study are presented in the form of comparative tables showing the hit and miss rates of the three models evaluated. Notably, it is determined that the Random Forest model, when applied in conjunction with tokenization techniques, exhibits superior efficiency compared to the other two models. The choice of the right machine learning model is critical to ensure efficiency in e-mail classification, and this study provides a solid basis for making informed decisions in the implementation of e-mail security systems in real-world business environments. Spam detection, supported by machine learning algo-rhythms, remains an evolving field and offers a promising solution to address a persistent problem in the digital world.

Author supplied keywords

Cite

CITATION STYLE

APA

Quirumbay Yagual, D., Soria Méndez, B., & Cruz Ruiz, V. (2024). Efficient clustering of e-mails by applying supervised machine learning algorithms. Journal of Applied Research and Technology, 22(4), 560–566. https://doi.org/10.22201/icat.24486736e.2024.22.4.2383

Efficient clustering of e-mails by applying supervised machine learning algorithms

Abstract

Author supplied keywords

Cite

Register to see more suggestions