正文

PhishGuard：基于XGBoost的实时钓鱼网址检测系统

一个端到端的机器学习系统，通过分析URL结构特征实现98.28%准确率的钓鱼网站检测，集成浏览器扩展提供实时防护，并具备完整的模型监控与自动重训练机制。

钓鱼检测XGBoost网络安全机器学习浏览器扩展URL分析实时防护MLOps

发布时间 2026/05/16 23:15最近活动 2026/05/16 23:19预计阅读 5 分钟

章节 01

PhishGuard: Core Overview of XGBoost-Based Real-Time Phishing URL Detection System

PhishGuard is an end-to-end machine learning system for real-time phishing URL detection using XGBoost, achieving 98.28% accuracy. It analyzes URL structural features (no need for webpage content, ensuring privacy and efficiency), integrates a browser extension for real-time protection, and includes full model monitoring and automatic retraining mechanisms. Key keywords: phishing detection, XGBoost, network security, browser extension, URL analysis, MLOps.

章节 02

Background: Phishing Threats & Limitations of Traditional Defenses

Phishing is one of the most prevalent and harmful network security threats—over 90% of network intrusions start with phishing emails or sites. Traditional blacklist-based defenses are passive: they cannot identify new phishing sites, and attackers can easily bypass them by changing domains. Thus, ML-based active detection has become an industry focus.

章节 03

Method: Feature Engineering from URL Structure

PhishGuard extracts 30 numerical features from URLs, grouped into 7 categories:

URL components: Protocol (FTP is almost exclusive to phishing samples), domain parts
Length: Phishing URLs have longer paths/queries (domain length is not a key distinguishing feature)
Domain: Phishing uses low-cost TLDs (.top/.icu/.dev/.app) and SLDs with more hyphens/digits
Entropy: Higher path entropy in phishing URLs (reflecting confusion techniques)
Character-level: Counts of dots, hyphens, digits, special characters etc.

章节 04

Method: Model Training & Optimization

Data preprocessing: Clean invalid URLs, expand short links, remove 33k+ duplicate records (final 200k balanced training set). Model selection: XGBoost outperforms random forest. Hyperparameter tuning via Bayesian optimization (recall提升至97.25%). Threshold adjustment (0.45) balances recall (97.34%) and precision (99.04%). SHAP analysis identifies top 8 impactful features (HTTPS existence, domain dot count, domain entropy etc.) and removes 4 low-impact features.

章节 05

System Architecture & Deployment

Micro-service architecture:

FastAPI inference: Asynchronous, Dockerized for high performance
Browser extension: Real-time URL interception, API calls, warning/blocking
Monitoring: MLflow (model versioning/tracking), Azure Blob (model storage), MongoDB (prediction logs/user feedback)

章节 06

Continuous Learning: Closed-Loop Feedback Mechanism

PhishGuard's adaptive design:

Data drift monitoring: Azure Functions check PSI (15-day rolling window) on user feedback data from MongoDB
Auto retraining: GitHub Actions trigger when drift is detected or enough new samples are accumulated; new model replaces production only if performance improves

章节 07

Application Value & Conclusion

Application value: Real-time detection (no webpage content load needed), privacy-friendly (no content access). It serves as a reference for developers to move ML models from experiment to production. Conclusion: PhishGuard is a fusion of cybersecurity and ML, an effective tool against evolving phishing attacks, with self-evolution capability to maintain relevance.