Abstract
This study introduces a hybrid phishing detection framework that combines machine learning with heuristic rule-based techniques to provide accurate, scalable, and policy-compliant detection across a variety of phishing types. The proposed method uses diverse datasets, including URL patterns, email headers, and HTML content, organized in a layered manner, allowing flexible analysis even when some features are missing. Feature selection techniques, such as variance thresholding and Recursive Feature Elimination (RFE), are applied to improve learning efficiency and reduce noise. Several classifiers, including Random Forest (RF), XGBoost, Gradient Boosting (GB), and CatBoost, are trained on optimized features, and their outputs are combined using voting to boost overall reliability. The system also includes a rule-based engine aligned with India’s national Email Policy, incorporating heuristic checks such as non-government domains, missing authentication (SPF/DKIM/DMARC), use of insecure protocols, foreign IPs, phishing URLs, and other threat indicators. Each rule is weighted and contributes to a composite suspicion score, which is explainable and policy-mapped. These heuristic signals are used both directly and as features for the machine learning models, allowing for layered, interpretable AI. The final phishing score balances the contribution of both heuristic and ML predictions and is compared against an optimized threshold to determine whether an input is phishing or safe. Experimental results on benchmark datasets demonstrate that heuristic-guided feature selection, combined with hybrid data integration, significantly improves performance, achieving an average accuracy exceeding 95% in real-world datasets. Individual models, including CatBoost and XGBoost, demonstrated outstanding performance, achieving training accuracies of up to 100% and testing accuracies of 96.7% and 96.4%, respectively, for URL datasets. For email header analysis, RF achieved the highest accuracy at 99.85%. The findings underscore the significance of feature engineering in developing scalable and reliable phishing detection systems.
Author supplied keywords
Cite
CITATION STYLE
Jadhav, A., & Chandre, P. (2025). A Hybrid Heuristic-Machine Learning Framework for Phishing Detection Using Multi-Domain Feature Analysis. Engineering, Technology and Applied Science Research, 15(5), 27219–27226. https://doi.org/10.48084/etasr.11548
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.