Zing 论坛

正文

PhishGuard:基于XGBoost的实时钓鱼网址检测系统

一个端到端的机器学习系统,通过分析URL结构特征实现98.28%准确率的钓鱼网站检测,集成浏览器扩展提供实时防护,并具备完整的模型监控与自动重训练机制。

钓鱼检测XGBoost网络安全机器学习浏览器扩展URL分析实时防护MLOps
发布时间 2026/05/16 23:15最近活动 2026/05/16 23:19预计阅读 5 分钟
PhishGuard:基于XGBoost的实时钓鱼网址检测系统
1

章节 01

PhishGuard: Core Overview of XGBoost-Based Real-Time Phishing URL Detection System

PhishGuard is an end-to-end machine learning system for real-time phishing URL detection using XGBoost, achieving 98.28% accuracy. It analyzes URL structural features (no need for webpage content, ensuring privacy and efficiency), integrates a browser extension for real-time protection, and includes full model monitoring and automatic retraining mechanisms. Key keywords: phishing detection, XGBoost, network security, browser extension, URL analysis, MLOps.

2

章节 02

Background: Phishing Threats & Limitations of Traditional Defenses

Phishing is one of the most prevalent and harmful network security threats—over 90% of network intrusions start with phishing emails or sites. Traditional blacklist-based defenses are passive: they cannot identify new phishing sites, and attackers can easily bypass them by changing domains. Thus, ML-based active detection has become an industry focus.

3

章节 03

Method: Feature Engineering from URL Structure

PhishGuard extracts 30 numerical features from URLs, grouped into 7 categories:

  1. URL components: Protocol (FTP is almost exclusive to phishing samples), domain parts
  2. Length: Phishing URLs have longer paths/queries (domain length is not a key distinguishing feature)
  3. Domain: Phishing uses low-cost TLDs (.top/.icu/.dev/.app) and SLDs with more hyphens/digits
  4. Entropy: Higher path entropy in phishing URLs (reflecting confusion techniques)
  5. Character-level: Counts of dots, hyphens, digits, special characters etc.
4

章节 04

Method: Model Training & Optimization

Data preprocessing: Clean invalid URLs, expand short links, remove 33k+ duplicate records (final 200k balanced training set). Model selection: XGBoost outperforms random forest. Hyperparameter tuning via Bayesian optimization (recall提升至97.25%). Threshold adjustment (0.45) balances recall (97.34%) and precision (99.04%). SHAP analysis identifies top 8 impactful features (HTTPS existence, domain dot count, domain entropy etc.) and removes 4 low-impact features.

5

章节 05

System Architecture & Deployment

Micro-service architecture:

  1. FastAPI inference: Asynchronous, Dockerized for high performance
  2. Browser extension: Real-time URL interception, API calls, warning/blocking
  3. Monitoring: MLflow (model versioning/tracking), Azure Blob (model storage), MongoDB (prediction logs/user feedback)
6

章节 06

Continuous Learning: Closed-Loop Feedback Mechanism

PhishGuard's adaptive design:

  1. Data drift monitoring: Azure Functions check PSI (15-day rolling window) on user feedback data from MongoDB
  2. Auto retraining: GitHub Actions trigger when drift is detected or enough new samples are accumulated; new model replaces production only if performance improves
7

章节 07

Application Value & Conclusion

Application value: Real-time detection (no webpage content load needed), privacy-friendly (no content access). It serves as a reference for developers to move ML models from experiment to production. Conclusion: PhishGuard is a fusion of cybersecurity and ML, an effective tool against evolving phishing attacks, with self-evolution capability to maintain relevance.