Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Tiago A. Almeida; Jurandy Almeida; Akebo Yamakami

Journal ArticleOPEN ACCESS

Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Journal of Internet Services and Applications (2011) 1(3) 183-200

DOI: 10.1007/s13174-010-0014-7

64Citations

73Readers

Abstract

E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance. © The Brazilian Computer Society 2010.

Author supplied keywords

Cite

CITATION STYLE

APA

Almeida, T. A., Almeida, J., & Yamakami, A. (2011). Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers. Journal of Internet Services and Applications, 1(3), 183–200. https://doi.org/10.1007/s13174-010-0014-7

Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers

Abstract

Author supplied keywords

Cite

Register to see more suggestions