Reading

Real-Time Phishing Detection System Based on Certificate Transparency Logs: Practice of Multi-Layer Machine Learning Architecture

This article introduces a real-time phishing website detection system that uses Certificate Transparency logs, Aho-Corasick brand pre-filtering, and a machine learning ensemble model to achieve real-time classification of malicious domains.

钓鱼检测证书透明度CT日志Aho-CorasickXGBoostLightGBM机器学习网络安全域名分类

Published 2026-06-16 03:16Recent activity 2026-06-16 03:23Estimated read 7 min

Real-Time Phishing Detection System Based on Certificate Transparency Logs: Practice of Multi-Layer Machine Learning Architecture

Section 01

[Introduction] Real-Time Phishing Detection System Based on CT Logs: Practice of Multi-Layer Machine Learning Architecture

Project Core: The PhishingClassifier project (by oliwiapietka, open-source on GitHub) builds a real-time phishing detection system based on Certificate Transparency (CT) logs. It combines Aho-Corasick brand pre-filtering with a Stacking ensemble machine learning model (XGBoost, LightGBM, Random Forest) to achieve fast identification of malicious domains, solving the lag issue of traditional blacklists and balancing detection speed and accuracy.

Section 02

Project Background and Security Threats

Threats of Phishing Attacks

Phishing attacks are a major cybersecurity threat; 90% of cyberattacks start with phishing emails. The lifespan of a phishing website is only a few hours, making traditional blacklists unable to respond in time.

Value of CT Logs

Certificate Transparency (CT) requires CAs to submit all SSL/TLS certificates to public logs. When attackers apply for certificates for phishing websites, they leave traces, allowing real-time monitoring to detect early phishing domains.

Project Origin

This project uses real-time monitoring of CT logs, combined with efficient algorithms and machine learning, to build a system for quickly identifying malicious domains.

Section 03

System Architecture and Core Technologies

System Architecture

Pipeline design: Data collection layer (real-time monitoring of CT logs to extract domains) → Preprocessing layer (Aho-Corasick brand pre-filtering) → Classification layer (machine learning ensemble model judgment).

Core Technologies

CT Log Monitoring: Pull logs via HTTP API, extract key information like domains, and handle high-throughput log streams.
Aho-Corasick Pre-Filtering: Multi-pattern string matching; the keyword library includes brand names, phishing keywords, and their variants, narrowing the scope of suspicious domains and improving efficiency.
Machine Learning Ensemble Model: The Stacking strategy integrates XGBoost (regularization, missing value handling), LightGBM (histogram optimization, leaf-wise growth), and Random Forest (Bagging, random feature selection), with a meta-classifier integrating the outputs of base models.
Feature Engineering: Structural features (domain length, number of subdomains, entropy value, etc.) + Linguistic features (brand similarity, readability, N-gram distribution, etc.).

Section 04

System Performance and Application Scenarios

System Performance

Real-time processing latency: CT log pulling takes seconds, Aho-Corasick matching takes milliseconds, model inference takes 10-50 milliseconds, and the overall detection is completed within a few seconds.
Accuracy and false positive rate: Thresholds can be adjusted for optimization (strict mode with low false positives is suitable for browsers; loose mode with high recall is suitable for research).

Application Scenarios

Browser Vendors: Integrate safe browsing functions to display high-risk warnings.
Enterprise Security: Monitor brand-related suspicious domains to detect attacks early.
Security Research: Analyze phishing trends and evaluate the effectiveness of detection algorithms.

Section 05

Technical Challenges and Optimization Directions

Technical Challenges and Countermeasures

Adversarial Attacks: Character obfuscation (homoglyphs) → visual similarity features; DGA domains → DGA feature analysis; slow attacks → combining WHOIS behavior patterns.
Model Interpretability: SHAP values to quantify feature contributions, decision path visualization, and manual review feedback.
Cold Start Problem: Active learning to label new brand domains, zero-shot learning using text embeddings, and community collaboration for intelligence sharing.

Section 06

Comparison with Related Technologies and Summary Insights

Comparison with Related Technologies

Traditional Blacklists: Lag in timeliness, coverage depends on manual work, high maintenance cost; CT monitoring enables second-level discovery, full coverage, and automated maintenance.
DNS Monitoring: Complementary to CT monitoring (CT discovers earlier, DNS captures non-CT domains).

Summary Insights

Core Design: Layered architecture balances speed and accuracy; multi-model integration improves stability; feature diversity enhances classification ability.
Insights: CT logs can be used as a source of threat intelligence; classic algorithms (Aho-Corasick) still have value; machine learning needs to be designed in combination with domain characteristics.