Efficient Ensemble-based Phishing Website Classification Models using Feature Importance Attribute Selection and Hyper parameter Tuning Approaches
Keywords:
Phishing , Cyber Security, Classification Models, Hyper parameter TuningAbstract
The internet is now a common place for different business, scientific and educational activities. However, there are bad elements in the internet space that keep using different attack techniques to perpetrate evils. Among these categories are people who use phishing techniques to launch attacks in the enterprise networks and internet space. The use of machine learning (ML) approaches for phishing attacks classification is an active research area in the field of cyber security. This is because phishing attack detection is a good example of intrusion identification tasks. These machine learning techniques can be categorized as single and ensemble learners. Ensemble learners have been identified to be more promising than the single classifiers. However, some of the ways to achieve an improved ML-based detection models are through feature selection/dimensionality reduction as well as hyper parameter tuning. This study focuses on the classification of phishing websites using ensemble learning algorithms. Random Forest (RF) and Extra Trees ensembles were used for the phishing classification. The models built from the algorithms are optimized by applying a feature importance attribute selection and hyper parameter tuning approaches. The RF-based phishing classification model achieved 99.3% accuracy, 0.996 recall, 0.983 f1-score, 0.996 precision and 1.000 as AUC score. Similarly, Extra Trees-based model attained 99.1% accuracy, 0.990 as recall, F1-score was 0.981, precision of 0.990 while AUC score is 1.000. Thus, the RF-based phishing classification model slightly achieved better classification results when compared with the Extra Trees own. The study concluded that attribute selection and hyper parameter tuning approaches employed are very promising.
Downloads
References
Adewale, O. S., & Olugbara, O. O. (2017). A Comparative Study of Machine Learning Algorithms for Email Spam Filtering, Expert Systems with Applications, 74, 219-236.
Aljammal, A. H., Taamneh , S. ., Qawasmeh, A. ., & Bani Salameh, H. (2023). Machine Learning Based Phishing Attacks Detection Using Multiple Datasets. International Journal of Interactive Mobile Technologies (iJIM), 17(05), pp. 71–83. https://doi.org/10.3991/ijim.v17i05.37575
APWG (2022). Phishing Activity Trends Report, 4th Quarter 2022, Unifying the Global Response To Cybercrime, Activity October - December 2022, https://docs.apwg.org/reports/apwg_trends_report_q4_2022.pdf
Biswas, A., Dasgupta, A., & Nag, P. K. (2018). Feature Engineering and Selection for Spam URL Classification, International Journal of Computer Applications, 179(30), 25-28.
Breiman L. (2001). Random Forests, Machine Learning, 45(1), 5-32, (2001). Available at: https://doi.org/10.1023/A:1010933404324
Hossain Sohrab, Sarma Dhiman & Chakma R. (2020). Machine Learning-Based Phishing Attack Detection, International Journal of Advanced Computer Science and Applications (IJACSA), (11)9, 2020DOI:10.14569/ijacsa.2020.0110945Corpus ID: 222469828
Jimoh R. G., Oyelakin A. M. Olatinwo , I. S., Obiwusi Y. K., Muhammad-Thani S., Ogundele T. S., Giwa-Raheem A. & Ayepeku O. F. (2022). Experimental Evaluation of Ensemble Learning-Based Models for Twitter Spam Classification, 2022 5th Information Technology for Education and Development (ITED) conference, held at Nile University Abuja, Nigeria
Li, X., & Li, X. (2019). Web page classification using machine learning: A comprehensive survey. ACM Computing Surveys, 52(6), 1-34.
Mohammad,Rami and McCluskey,Lee. (2015). Phishing Websites. UCI Machine Learning Repository. https://doi.org/10.24432/C51W2X
Martin Jessica (2022). How phishing can ruin the good name of an online brand, published by reputation, retrieved from https://blog.reputationx.com/guest/whats-phishing on 1st July, 2023
Mohammad, Rami M., Thabtah, Fadi & McCluskey, Lee. (2014). Intelligent Rule based Phishing Websites Classification. IET Information Security, 8 (3), 153-160. 2014, 1751-8709, available at https://archive.ics.uci.edu/ml/machine-learning-databases/00327/
Mohanty Sanjukta & Acharya Arup Abhinna (2023). MFBFST: Building a stable ensemble learning model using multivariate filter-based feature selection technique for detection of suspicious URL, Procedia Computer Science, Volume 218, 2023, Pages 1668-1681
Orji, I. J., & Emekwuru, O. E. (2019). Comparative Analysis of Machine Learning Algorithms for Phishing Website Detection. International Journal of Computer Science and Information Technology Research, 7(2), 98-106.
Oyelakin A. M., Olatinwo I. S., Rilwan D. M., Azeez R. D. & Obiwusi Y. K (2021a). Investigation into the Performances of Supervised Learning Algorithms in different Phishing Datasets, Pakistan Journal of Engineering Technology and Science (PJETS), 9(2), 24-32
Oyelakin A. M., Alimi M. O., Mustapha I.O. & Ajiboye I. K. (2021b). Analysis of Single and Ensemble Machine Learning Classifiers for Phishing Attacks Detection. International Journal of Software Engineering and Computer Systems, 7(2), 44–49, Faculty of Computing, College of Computing and Applied Sciences, Universiti Malaysia Pahang, https://doi.org/10.15282/ijsecs.7.2.2021.5.0088
Oyelakin A. M., Alimi O. M., & Abdulrauf T. (2020). Performance Analysis of Selected Machine Learning Algorithms for the Classification of Phishing URLs, Journal of Computer Science and Control Systems, 13(2), 16–19 , available at https://electroinf.uoradea.ro/images/articles/CERCETARE/Reviste/JCSCS/JCSC_V13_N2_oct2020/JCSCS VOL 13 NO 2 OCTOBER 2020 Oyelakin_Performance.pdf
Oyelakin A. M. (2014). Spear Phishing Email Attack on Nigerian Bank Account Holders: Online Awareness to the Rescue, in the proceedings of ISTEAM Conference 2014, Afe Babalola University, Ado Ekiti, Nigeria, 185-188
Patil Dharmaraj R. & Patil Jayantrao (2018). Malicious URLs Detection Using Decision Tree Classifiers and Majority Voting Technique, Cybernetics and Information Technologies 18(1):11-29, DOI: , 10.2478/cait-2018-0002
Pierre Geurts, Damien Ernst & Louis Wehenkel (2006). Extremely randomized trees, Machine Learning, 63: 3–42, DOI:10.1007/s10994-006-6226-1https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf
Yang Li and Shami Abdallah (2022).On Hyperparameter Optimization of Machine Learning
Algorithms: Theory and Practice, a preprint retrieved from arXiv:2007.15745v3 [cs.LG] 5 Oct 2022
Published
How to Cite
Issue
Section
Copyright (c) 2023 Journal of Information Technology and Computing

This work is licensed under a Creative Commons Attribution 4.0 International License.
- Copyright and Licensing
For all articles published in SABA journals, copyright is retained by the authors. Articles are licensed under an open access Creative Commons CC BY 4.0 license, meaning that anyone may download and read the paper for free. In addition, the article may be reused and quoted provided that the original published version is cited. These conditions allow for maximum use and exposure of the work, while ensuring that the authors receive proper credit.