Zing Forum

Reading

PhishGuard: A Machine Learning-Based Phishing Website Detection System to Safeguard Cybersecurity

This article introduces the PhishGuard project, a Flask web application that uses machine learning technology to detect phishing URLs. The system combines WHOIS data, URL feature analysis, and user authentication mechanisms to provide real-time phishing website identification and historical tracking functions.

钓鱼检测网络安全机器学习FlaskWHOISURL分析Web安全威胁检测恶意网站用户认证
Published 2026-05-31 07:45Recent activity 2026-05-31 07:55Estimated read 10 min
PhishGuard: A Machine Learning-Based Phishing Website Detection System to Safeguard Cybersecurity
1

Section 01

PhishGuard: Introduction to the Machine Learning-Based Phishing Website Detection System

Core Introduction to PhishGuard

PhishGuard is an open-source project developed and maintained by nguyentrion (GitHub link: https://github.com/nguyentrion/Phishguard, released on 2026-05-30). It is a Flask web application based on machine learning technology, designed to detect phishing URLs. The system combines WHOIS data, URL feature analysis, and user authentication mechanisms to provide real-time phishing website identification and historical tracking functions, in response to the increasingly severe threat of phishing attacks.

2

Section 02

Current State of Phishing Attacks and Limitations of Traditional Defenses

Current State of Phishing Attacks and Limitations of Traditional Defenses

Severe Situation of Phishing Attacks

Phishing attacks are one of the most common and destructive threats in the field of cybersecurity. Attackers lure users into revealing sensitive information by forging trusted websites, causing billions of dollars in losses each year. Common methods include domain spoofing (typos, character substitutions, TLD replacements, subdomain deception), page cloning (copying real website content and layout), and social engineering (urgent notifications, reward temptations, authority impersonation).

Limitations of Traditional Defenses

Traditional blacklist mechanisms have obvious shortcomings: delayed marking of new domains, short links concealing real targets, abuse of HTTPS (attackers also use SSL certificates), and difficulty in detecting dynamically generated attack pages.

3

Section 03

PhishGuard System Architecture and Core Components

PhishGuard System Architecture and Core Components

Overall Architecture

Adopts a three-tier architecture: User Interface Layer (Flask Templates) → Business Logic Layer (Flask Routes + ML Model) → Data Layer (SQLite + WHOIS API).

Core Components

  1. URL Feature Extraction: Extracts structural features (length, domain length, path depth, number of special characters), semantic features (sensitive words, brand names, suspicious TLDs), and technical features (IP addresses, non-standard ports, excessive encoding) from URLs.
  2. WHOIS Data Integration: Uses domain age (newly registered domains <30 days are high risk), registration information (privacy protection, registrar reputation, country), and DNS records (free DNS services, abnormal MX records) as detection features.
  3. Machine Learning Model: Uses supervised learning, converts features into numerical vectors, and supports models such as Random Forest, XGBoost, Logistic Regression, and Neural Networks. Training data comes from legitimate URLs (top-ranked websites on Alexa) and phishing URLs (PhishTank, OpenPhish databases).
  4. Web Application Layer: Provides user registration/login (password stored as hash), single/batch URL detection interfaces, and detection history records and statistical functions.
4

Section 04

PhishGuard Technical Implementation Details

PhishGuard Technical Implementation Details

Data Flow

User inputs URL → URL parsing and verification → Feature extraction → WHOIS asynchronous query → Feature vector construction → ML model prediction → Result display and historical record storage.

Performance Optimization

  • WHOIS Cache: Caches query results locally with an expiration time; asynchronous queries avoid blocking.
  • Model Inference Optimization: Preloads models into memory, supports batch requests, and uses lightweight models to reduce latency.

Database Design

Includes a detection history table (stores user ID, URL, prediction result, confidence, timestamp) and a WHOIS cache table (stores domain name, registration date, registrar, cache time).

5

Section 05

PhishGuard Application Scenarios

PhishGuard Application Scenarios

  1. Personal User Protection: As a browser plugin or independent web application, it provides link pre-detection, real-time warnings, and historical record review functions.
  2. Enterprise Security Gateway: Integrated into email gateways (detect phishing links), web proxies (filter malicious URLs), and SIEM systems (security event correlation analysis).
  3. Security Research: Provides phishing URL datasets, feature analysis tools, and model effect evaluation support for researchers.
6

Section 06

Limitations and Improvement Directions of PhishGuard

Limitations and Improvement Directions of PhishGuard

Current Limitations

  • Adversarial Attacks: Attackers can bypass detection through feature evasion, model deception, and concept drift.
  • False Positives and False Negatives: Legitimate websites are misjudged or new phishing methods are missed; balancing the two is challenging.
  • Dependency on External Services: WHOIS queries rely on third parties; service unavailability or rate limits affect detection capabilities.

Improvement Directions

  • Multi-Model Fusion: Voting mechanisms, stacking integration, and confidence weighting to improve accuracy.
  • Deep Learning: Character-level CNN, LSTM, Transformer to process raw URL strings.
  • Real-Time Learning: Online model updates, integration of user feedback, and active identification of new threats.
  • Multi-Dimensional Detection: Combine page content analysis, visual similarity detection, behavior analysis, and threat intelligence integration.
7

Section 07

Cybersecurity Ecosystem and Conclusion

Cybersecurity Ecosystem and Conclusion

Open Source Community and Industry Standards

PhishGuard integrates into the open-source ecosystem and collaborates with projects such as PhishTank (community phishing URL database) and OpenPhish (real-time intelligence service). It follows industry standards like DMARC, SPF/DKIM, HSTS, and Certificate Transparency.

Collaborative Defense

Effective phishing defense requires multi-party collaboration: security vendors share intelligence, registrars quickly take down malicious domains, and user education enhances security awareness.

Conclusion

PhishGuard demonstrates the practical application of machine learning in cybersecurity, but its value lies more in its open-source nature, allowing the community to jointly improve and respond to new threats. Technical tools need to be combined with user security awareness to build an effective defense line.