A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern

https://doi.org/10.48185/jitc.v4i1.653

Authors

  • Mohamed Aly Bouke University Putra Malaysia
  • Azizol Abdullah University Putra Malaysia
  • Mohd Taufik Abdullah University Putra Malaysia
  • Saleh Ali Zaid University Putra Malaysia
  • Hayate El Atigh Bandirma Onyedi Eylul University
  • Sameer Hamoud ALshatebi

Keywords:

Machine Learning (ML), Spam Detection, Random Forest

Abstract

This Spam emails have become a severe challenge that irritates and consumes recipients' time. On the one hand, existing spam detection techniques have low detection rates and cannot tolerate high-dimensional data. Moreover, due to the machine learning algorithm's effectiveness in identifying mail as solicited or unsolicited, their approaches have become common in spam detection systems. This paper proposes a lightweight machine learning-based spam detection model based on Random Forest (RF) algorithm. According to the empirical results, the proposed model achieved a 97% accuracy on the spambase dataset. Furthermore, the performance of the proposed model was evaluated using standard classification metrics such as Fscore, Recall, Precision, and Accuracy. The comparison of Our model with state-of-the-art works investigated in this paper showed the model performs better, with an improvement of 6% for all metrics.

Downloads

Download data is not yet available.

References

S. Whittaker, V. Bellotti, and P. Moody, "Introduction to this special issue on revisiting and reinventing e-mail," Human-Computer Interact., vol. 20, no. 1–2, pp. 1–9, 2005.

H. Faris et al., "An intelligent system for spam detection and identification of the most relevant features based on evolutionary Random Weight Networks," Inf. Fusion, vol. 48, no. June 2018, pp. 67–83, 2019, doi: 10.1016/j.inffus.2018.08.002.

E. S. M. El-Alfy and R. E. Abdel-Aal, "Using GMDH-based networks for improved spam detection and email feature analysis," Appl. Soft Comput. J., vol. 11, no. 1, pp. 477–488, 2011, doi: 10.1016/j.asoc.2009.12.007.

E. P. Sanz, J. M. Gómez Hidalgo, and J. C. Cortizo Pérez, “Chapter 3 Email Spam Filtering,” Adv. Comput., vol. 74, no. 08, pp. 45–114, 2008, doi: 10.1016/S0065-2458(08)00603-7.

Y. Hu, C. Guo, E. W. T. Ngai, M. Liu, and S. Chen, "A scalable, intelligent non-content-based spam-filtering framework," Expert Syst. Appl., vol. 37, no. 12, pp. 8557–8565, 2010, doi: 10.1016/j.eswa.2010.05.020.

Y. Cohen, D. Gordon, and D. Hendler, "Early detection of spamming accounts in large-Scale service provider networks," Knowledge-Based Syst., vol. 142, pp. 241–255, 2018, doi: 10.1016/j.knosys.2017.11.040.

J. D. Rosita P and W. S. Jacob, "Multi-Objective Genetic Algorithm and CNN-Based Deep Learning Architectural Scheme for effective spam detection," Int. J. Intell. Networks, vol. 3, no. December 2021, pp. 9–15, 2022, doi: 10.1016/j.ijin.2022.01.001.

A. Harisinghaney, A. Dixit, S. Gupta, and A. Arora, "Text and image based spam email classification using KNN, Na{ "i}ve Bayes and Reverse DBSCAN algorithm," in 2014 International Conference on Reliability Optimization and Information Technology (ICROIT), 2014, pp. 153–155.

D. Debarr and H. Wechsler, "Spam detection using Random Boost," Pattern Recognit. Lett., vol. 33, no. 10, pp. 1237–1244, 2012, doi: 10.1016/j.patrec.2012.03.012.

M. Mohamad and A. Selamat, "An evaluation on the efficiency of hybrid feature selection in spam email classification," in 2015 International Conference on Computer, Communications, and Control Technology (I4CT), 2015, pp. 227–231.

H. Faris, I. Aljarah, and J. Alqatawna, "Optimizing feedforward neural networks using krill herd algorithm for e-mail spam detection," in 2015 IEEE Jordan conference on applied electrical engineering and computing technologies (AEECT), 2015, pp. 1–5.

N. O. Hamed, A. H. Samak, and M. A. Ahmad, "Cloud e-mail security: An accurate e-mail spam classification based on enhanced binary differential evolution (BDE) algorithm," J. Intell. & Fuzzy Syst., no. Preprint, pp. 1–13, 2021.

V. Sri Vinitha and D. Karthika Renuka, "MapReduce mRMR: Random Forests-Based Email Spam Classification in Distributed Environment," in Data Management, Analytics and Innovation, Springer, 2020, pp. 241–253.

H. M. Saleh, "An Efficient feature selection algorithm for the spam email classification," Period. Eng. Nat. Sci., vol. 9, no. 3, pp. 520–531, 2021.

F. Soleimanian Gharehchopogh and S. K. Mousavi, "A new feature selection in email spam detection by particle swarm optimization and fruit fly optimization algorithms," Comput. Knowl. Eng., vol. 2, no. 2, pp. 49–62, 2020.

S. A. Khamis, C. F. M. Foozy, M. F. A. Aziz, and N. Rahim, "Header based email spam detection framework using Support Vector Machine (SVM) Technique," in International conference on soft computing and data mining, 2020, pp. 57–65.

"UCI Machine Learning Repository: Spambase Data Set." https://archive.ics.uci.edu/ml/datasets/spambase (accessed May 07, 2022).

A. Boschetti and L. Massaron, Python data science essentials: become an efficient data science practitioner by thoroughly understanding the key concepts of Python. 2015. Accessed: Nov. 29, 2021. [Online]. Available: www.packtpub.com

J. Brownlee, "Imbalanced Classification with Python," Mach. Learn. Mastery, p. 463, 2020.

A. Ali, S. M. Shamsuddin, and A. L. Ralescu, "Classification with class imbalance problem: A review," Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176–204, 2015.

D. Zhang, W. Liu, X. Gong, and H. Jin, "A novel improved SMOTE resampling algorithm based on fractal," J. Comput. Inf. Syst., vol. 7, no. 6, pp. 2204–2211, 2011.

Y. Pristyanto, I. Pratama, and A. F. Nugraha, "Data level approach for imbalanced class handling on educational data mining multiclass classification," 2018 Int. Conf. Inf. Commun. Technol. ICOIACT 2018, vol. 2018-Janua, pp. 310–314, 2018, doi: 10.1109/ICOIACT.2018.8350792.

J. Prusa, T. M. Khoshgoftaar, D. J. DIttman, and A. Napolitano, "Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data," Proc. - 2015 IEEE 16th Int. Conf. Inf. Reuse Integr. IRI 2015, pp. 197–202, 2015, doi: 10.1109/IRI.2015.39.

R. Mohammed, J. Rawashdeh, and M. Abdullah, "Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results," 2020 11th Int. Conf. Inf. Commun. Syst. ICICS 2020, no. April, pp. 243–248, 2020, doi: 10.1109/ICICS49469.2020.239556.

J. Brownlee, Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery, 2020.

R. Nayak, S. Amirali Jiwani, and B. Rajitha, "Spam email detection using machine learning algorithm," Mater. Today Proc., no. xxxx, 2021, doi: 10.1016/j.matpr.2021.03.147.

Published

2023-06-27

How to Cite

Bouke, M. A., Abdullah, A., Abdullah, M. T., Zaid, S. A., El Atigh , H., & ALshatebi, S. H. . (2023). A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern. Journal of Information Technology and Computing, 4(1), 15–28. https://doi.org/10.48185/jitc.v4i1.653

Issue

Section

Articles