# PhishGuard: An XGBoost-Based Real-Time Phishing URL Detection System

> An end-to-end machine learning system that achieves 98.28% accuracy in phishing website detection by analyzing URL structural features. It integrates a browser extension for real-time protection and includes a complete model monitoring and automatic retraining mechanism.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-16T15:15:41.000Z
- 最近活动: 2026-05-16T15:19:59.170Z
- 热度: 150.9
- 关键词: 钓鱼检测, XGBoost, 网络安全, 机器学习, 浏览器扩展, URL分析, 实时防护, MLOps
- 页面链接: https://www.zingnex.cn/en/forum/thread/phishguard-xgboost
- Canonical: https://www.zingnex.cn/forum/thread/phishguard-xgboost
- Markdown 来源: floors_fallback

---

## PhishGuard: Core Overview of XGBoost-Based Real-Time Phishing URL Detection System

PhishGuard is an end-to-end machine learning system for real-time phishing URL detection using XGBoost, achieving 98.28% accuracy. It analyzes URL structural features (no need for webpage content, ensuring privacy and efficiency), integrates a browser extension for real-time protection, and includes full model monitoring and automatic retraining mechanisms. Key keywords: phishing detection, XGBoost, network security, browser extension, URL analysis, MLOps.

## Background: Phishing Threats & Limitations of Traditional Defenses

Phishing is one of the most prevalent and harmful network security threats—over 90% of network intrusions start with phishing emails or sites. Traditional blacklist-based defenses are passive: they cannot identify new phishing sites, and attackers can easily bypass them by changing domains. Thus, ML-based active detection has become an industry focus.

## Method: Feature Engineering from URL Structure

PhishGuard extracts 30 numerical features from URLs, grouped into 7 categories:
1. URL components: Protocol (FTP is almost exclusive to phishing samples), domain parts
2. Length: Phishing URLs have longer paths/queries (domain length is not a key distinguishing feature)
3. Domain: Phishing uses low-cost TLDs (.top/.icu/.dev/.app) and SLDs with more hyphens/digits
4. Entropy: Higher path entropy in phishing URLs (reflecting confusion techniques)
5. Character-level: Counts of dots, hyphens, digits, special characters etc.

## Method: Model Training & Optimization

Data preprocessing: Clean invalid URLs, expand short links, remove 33k+ duplicate records (final 200k balanced training set). Model selection: XGBoost outperforms random forest. Hyperparameter tuning via Bayesian optimization (recall increased to 97.25%). Threshold adjustment (0.45) balances recall (97.34%) and precision (99.04%). SHAP analysis identifies top 8 impactful features (HTTPS existence, domain dot count, domain entropy etc.) and removes 4 low-impact features.

## System Architecture & Deployment

Micro-service architecture:
1. FastAPI inference: Asynchronous, Dockerized for high performance
2. Browser extension: Real-time URL interception, API calls, warning/blocking
3. Monitoring: MLflow (model versioning/tracking), Azure Blob (model storage), MongoDB (prediction logs/user feedback)

## Continuous Learning: Closed-Loop Feedback Mechanism

PhishGuard's adaptive design:
1. Data drift monitoring: Azure Functions check PSI (15-day rolling window) on user feedback data from MongoDB
2. Auto retraining: GitHub Actions trigger when drift is detected or enough new samples are accumulated; new model replaces production only if performance improves

## Application Value & Conclusion

Application value: Real-time detection (no webpage content load needed), privacy-friendly (no content access). It serves as a reference for developers to move ML models from experiment to production. Conclusion: PhishGuard is a fusion of cybersecurity and ML, an effective tool against evolving phishing attacks, with self-evolution capability to maintain relevance.
