Zing Forum

Reading

PhishGuard: An XGBoost-Based Real-Time Phishing URL Detection System

An end-to-end machine learning system that achieves 98.28% accuracy in phishing website detection by analyzing URL structural features. It integrates a browser extension for real-time protection and includes a complete model monitoring and automatic retraining mechanism.

钓鱼检测XGBoost网络安全机器学习浏览器扩展URL分析实时防护MLOps
Published 2026-05-16 23:15Recent activity 2026-05-16 23:19Estimated read 5 min
PhishGuard: An XGBoost-Based Real-Time Phishing URL Detection System
1

Section 01

PhishGuard: Core Overview of XGBoost-Based Real-Time Phishing URL Detection System

PhishGuard is an end-to-end machine learning system for real-time phishing URL detection using XGBoost, achieving 98.28% accuracy. It analyzes URL structural features (no need for webpage content, ensuring privacy and efficiency), integrates a browser extension for real-time protection, and includes full model monitoring and automatic retraining mechanisms. Key keywords: phishing detection, XGBoost, network security, browser extension, URL analysis, MLOps.

2

Section 02

Background: Phishing Threats & Limitations of Traditional Defenses

Phishing is one of the most prevalent and harmful network security threats—over 90% of network intrusions start with phishing emails or sites. Traditional blacklist-based defenses are passive: they cannot identify new phishing sites, and attackers can easily bypass them by changing domains. Thus, ML-based active detection has become an industry focus.

3

Section 03

Method: Feature Engineering from URL Structure

PhishGuard extracts 30 numerical features from URLs, grouped into 7 categories:

  1. URL components: Protocol (FTP is almost exclusive to phishing samples), domain parts
  2. Length: Phishing URLs have longer paths/queries (domain length is not a key distinguishing feature)
  3. Domain: Phishing uses low-cost TLDs (.top/.icu/.dev/.app) and SLDs with more hyphens/digits
  4. Entropy: Higher path entropy in phishing URLs (reflecting confusion techniques)
  5. Character-level: Counts of dots, hyphens, digits, special characters etc.
4

Section 04

Method: Model Training & Optimization

Data preprocessing: Clean invalid URLs, expand short links, remove 33k+ duplicate records (final 200k balanced training set). Model selection: XGBoost outperforms random forest. Hyperparameter tuning via Bayesian optimization (recall increased to 97.25%). Threshold adjustment (0.45) balances recall (97.34%) and precision (99.04%). SHAP analysis identifies top 8 impactful features (HTTPS existence, domain dot count, domain entropy etc.) and removes 4 low-impact features.

5

Section 05

System Architecture & Deployment

Micro-service architecture:

  1. FastAPI inference: Asynchronous, Dockerized for high performance
  2. Browser extension: Real-time URL interception, API calls, warning/blocking
  3. Monitoring: MLflow (model versioning/tracking), Azure Blob (model storage), MongoDB (prediction logs/user feedback)
6

Section 06

Continuous Learning: Closed-Loop Feedback Mechanism

PhishGuard's adaptive design:

  1. Data drift monitoring: Azure Functions check PSI (15-day rolling window) on user feedback data from MongoDB
  2. Auto retraining: GitHub Actions trigger when drift is detected or enough new samples are accumulated; new model replaces production only if performance improves
7

Section 07

Application Value & Conclusion

Application value: Real-time detection (no webpage content load needed), privacy-friendly (no content access). It serves as a reference for developers to move ML models from experiment to production. Conclusion: PhishGuard is a fusion of cybersecurity and ML, an effective tool against evolving phishing attacks, with self-evolution capability to maintain relevance.