Reading

PhishGuard: An XGBoost-Based Real-Time Phishing URL Detection System

An end-to-end machine learning system that achieves 98.28% accuracy in phishing website detection by analyzing URL structural features. It integrates a browser extension for real-time protection and includes a complete model monitoring and automatic retraining mechanism.

钓鱼检测XGBoost网络安全机器学习浏览器扩展URL分析实时防护MLOps

Published 2026-05-16 23:15Recent activity 2026-05-16 23:19Estimated read 5 min

PhishGuard: An XGBoost-Based Real-Time Phishing URL Detection System

Section 01

PhishGuard: Core Overview of XGBoost-Based Real-Time Phishing URL Detection System

PhishGuard is an end-to-end machine learning system for real-time phishing URL detection using XGBoost, achieving 98.28% accuracy. It analyzes URL structural features (no need for webpage content, ensuring privacy and efficiency), integrates a browser extension for real-time protection, and includes full model monitoring and automatic retraining mechanisms. Key keywords: phishing detection, XGBoost, network security, browser extension, URL analysis, MLOps.

Section 02

Background: Phishing Threats & Limitations of Traditional Defenses

Phishing is one of the most prevalent and harmful network security threats—over 90% of network intrusions start with phishing emails or sites. Traditional blacklist-based defenses are passive: they cannot identify new phishing sites, and attackers can easily bypass them by changing domains. Thus, ML-based active detection has become an industry focus.

Section 03

Method: Feature Engineering from URL Structure

PhishGuard extracts 30 numerical features from URLs, grouped into 7 categories:

URL components: Protocol (FTP is almost exclusive to phishing samples), domain parts
Length: Phishing URLs have longer paths/queries (domain length is not a key distinguishing feature)
Domain: Phishing uses low-cost TLDs (.top/.icu/.dev/.app) and SLDs with more hyphens/digits
Entropy: Higher path entropy in phishing URLs (reflecting confusion techniques)
Character-level: Counts of dots, hyphens, digits, special characters etc.

Section 04

Method: Model Training & Optimization

Data preprocessing: Clean invalid URLs, expand short links, remove 33k+ duplicate records (final 200k balanced training set). Model selection: XGBoost outperforms random forest. Hyperparameter tuning via Bayesian optimization (recall increased to 97.25%). Threshold adjustment (0.45) balances recall (97.34%) and precision (99.04%). SHAP analysis identifies top 8 impactful features (HTTPS existence, domain dot count, domain entropy etc.) and removes 4 low-impact features.

Section 05

System Architecture & Deployment

Micro-service architecture:

FastAPI inference: Asynchronous, Dockerized for high performance
Browser extension: Real-time URL interception, API calls, warning/blocking
Monitoring: MLflow (model versioning/tracking), Azure Blob (model storage), MongoDB (prediction logs/user feedback)

Section 06

Continuous Learning: Closed-Loop Feedback Mechanism

PhishGuard's adaptive design:

Data drift monitoring: Azure Functions check PSI (15-day rolling window) on user feedback data from MongoDB
Auto retraining: GitHub Actions trigger when drift is detected or enough new samples are accumulated; new model replaces production only if performance improves

Section 07

Application Value & Conclusion

Application value: Real-time detection (no webpage content load needed), privacy-friendly (no content access). It serves as a reference for developers to move ML models from experiment to production. Conclusion: PhishGuard is a fusion of cybersecurity and ML, an effective tool against evolving phishing attacks, with self-evolution capability to maintain relevance.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54