Zing Forum

Reading

Real-Time Phishing Detection System Based on Certificate Transparency Logs: Practice of Multi-Layer Machine Learning Architecture

This article introduces a real-time phishing website detection system that uses Certificate Transparency logs, Aho-Corasick brand pre-filtering, and a machine learning ensemble model to achieve real-time classification of malicious domains.

钓鱼检测证书透明度CT日志Aho-CorasickXGBoostLightGBM机器学习网络安全域名分类
Published 2026-06-16 03:16Recent activity 2026-06-16 03:23Estimated read 7 min
Real-Time Phishing Detection System Based on Certificate Transparency Logs: Practice of Multi-Layer Machine Learning Architecture
1

Section 01

[Introduction] Real-Time Phishing Detection System Based on CT Logs: Practice of Multi-Layer Machine Learning Architecture

Project Core: The PhishingClassifier project (by oliwiapietka, open-source on GitHub) builds a real-time phishing detection system based on Certificate Transparency (CT) logs. It combines Aho-Corasick brand pre-filtering with a Stacking ensemble machine learning model (XGBoost, LightGBM, Random Forest) to achieve fast identification of malicious domains, solving the lag issue of traditional blacklists and balancing detection speed and accuracy.

2

Section 02

Project Background and Security Threats

Threats of Phishing Attacks

  • Phishing attacks are a major cybersecurity threat; 90% of cyberattacks start with phishing emails. The lifespan of a phishing website is only a few hours, making traditional blacklists unable to respond in time.

Value of CT Logs

  • Certificate Transparency (CT) requires CAs to submit all SSL/TLS certificates to public logs. When attackers apply for certificates for phishing websites, they leave traces, allowing real-time monitoring to detect early phishing domains.

Project Origin

  • This project uses real-time monitoring of CT logs, combined with efficient algorithms and machine learning, to build a system for quickly identifying malicious domains.
3

Section 03

System Architecture and Core Technologies

System Architecture

  • Pipeline design: Data collection layer (real-time monitoring of CT logs to extract domains) → Preprocessing layer (Aho-Corasick brand pre-filtering) → Classification layer (machine learning ensemble model judgment).

Core Technologies

  1. CT Log Monitoring: Pull logs via HTTP API, extract key information like domains, and handle high-throughput log streams.
  2. Aho-Corasick Pre-Filtering: Multi-pattern string matching; the keyword library includes brand names, phishing keywords, and their variants, narrowing the scope of suspicious domains and improving efficiency.
  3. Machine Learning Ensemble Model: The Stacking strategy integrates XGBoost (regularization, missing value handling), LightGBM (histogram optimization, leaf-wise growth), and Random Forest (Bagging, random feature selection), with a meta-classifier integrating the outputs of base models.
  4. Feature Engineering: Structural features (domain length, number of subdomains, entropy value, etc.) + Linguistic features (brand similarity, readability, N-gram distribution, etc.).
4

Section 04

System Performance and Application Scenarios

System Performance

  • Real-time processing latency: CT log pulling takes seconds, Aho-Corasick matching takes milliseconds, model inference takes 10-50 milliseconds, and the overall detection is completed within a few seconds.
  • Accuracy and false positive rate: Thresholds can be adjusted for optimization (strict mode with low false positives is suitable for browsers; loose mode with high recall is suitable for research).

Application Scenarios

  • Browser Vendors: Integrate safe browsing functions to display high-risk warnings.
  • Enterprise Security: Monitor brand-related suspicious domains to detect attacks early.
  • Security Research: Analyze phishing trends and evaluate the effectiveness of detection algorithms.
5

Section 05

Technical Challenges and Optimization Directions

Technical Challenges and Countermeasures

  1. Adversarial Attacks: Character obfuscation (homoglyphs) → visual similarity features; DGA domains → DGA feature analysis; slow attacks → combining WHOIS behavior patterns.
  2. Model Interpretability: SHAP values to quantify feature contributions, decision path visualization, and manual review feedback.
  3. Cold Start Problem: Active learning to label new brand domains, zero-shot learning using text embeddings, and community collaboration for intelligence sharing.
6

Section 06

Comparison with Related Technologies and Summary Insights

Comparison with Related Technologies

  • Traditional Blacklists: Lag in timeliness, coverage depends on manual work, high maintenance cost; CT monitoring enables second-level discovery, full coverage, and automated maintenance.
  • DNS Monitoring: Complementary to CT monitoring (CT discovers earlier, DNS captures non-CT domains).

Summary Insights

  • Core Design: Layered architecture balances speed and accuracy; multi-model integration improves stability; feature diversity enhances classification ability.
  • Insights: CT logs can be used as a source of threat intelligence; classic algorithms (Aho-Corasick) still have value; machine learning needs to be designed in combination with domain characteristics.