# Real-Time Phishing Detection System Based on Certificate Transparency Logs: Practice of Multi-Layer Machine Learning Architecture

> This article introduces a real-time phishing website detection system that uses Certificate Transparency logs, Aho-Corasick brand pre-filtering, and a machine learning ensemble model to achieve real-time classification of malicious domains.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-15T19:16:39.000Z
- 最近活动: 2026-06-15T19:23:51.443Z
- 热度: 143.9
- 关键词: 钓鱼检测, 证书透明度, CT日志, Aho-Corasick, XGBoost, LightGBM, 机器学习, 网络安全, 域名分类
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-oliwiapietka-phishingclassifier
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-oliwiapietka-phishingclassifier
- Markdown 来源: floors_fallback

---

## [Introduction] Real-Time Phishing Detection System Based on CT Logs: Practice of Multi-Layer Machine Learning Architecture

**Project Core**: The PhishingClassifier project (by oliwiapietka, open-source on GitHub) builds a real-time phishing detection system based on Certificate Transparency (CT) logs. It combines Aho-Corasick brand pre-filtering with a Stacking ensemble machine learning model (XGBoost, LightGBM, Random Forest) to achieve fast identification of malicious domains, solving the lag issue of traditional blacklists and balancing detection speed and accuracy.

## Project Background and Security Threats

### Threats of Phishing Attacks
- Phishing attacks are a major cybersecurity threat; 90% of cyberattacks start with phishing emails. The lifespan of a phishing website is only a few hours, making traditional blacklists unable to respond in time.
### Value of CT Logs
- Certificate Transparency (CT) requires CAs to submit all SSL/TLS certificates to public logs. When attackers apply for certificates for phishing websites, they leave traces, allowing real-time monitoring to detect early phishing domains.
### Project Origin
- This project uses real-time monitoring of CT logs, combined with efficient algorithms and machine learning, to build a system for quickly identifying malicious domains.

## System Architecture and Core Technologies

### System Architecture
- Pipeline design: Data collection layer (real-time monitoring of CT logs to extract domains) → Preprocessing layer (Aho-Corasick brand pre-filtering) → Classification layer (machine learning ensemble model judgment).
### Core Technologies
1. **CT Log Monitoring**: Pull logs via HTTP API, extract key information like domains, and handle high-throughput log streams.
2. **Aho-Corasick Pre-Filtering**: Multi-pattern string matching; the keyword library includes brand names, phishing keywords, and their variants, narrowing the scope of suspicious domains and improving efficiency.
3. **Machine Learning Ensemble Model**: The Stacking strategy integrates XGBoost (regularization, missing value handling), LightGBM (histogram optimization, leaf-wise growth), and Random Forest (Bagging, random feature selection), with a meta-classifier integrating the outputs of base models.
4. **Feature Engineering**: Structural features (domain length, number of subdomains, entropy value, etc.) + Linguistic features (brand similarity, readability, N-gram distribution, etc.).

## System Performance and Application Scenarios

### System Performance
- Real-time processing latency: CT log pulling takes seconds, Aho-Corasick matching takes milliseconds, model inference takes 10-50 milliseconds, and the overall detection is completed within a few seconds.
- Accuracy and false positive rate: Thresholds can be adjusted for optimization (strict mode with low false positives is suitable for browsers; loose mode with high recall is suitable for research).
### Application Scenarios
- **Browser Vendors**: Integrate safe browsing functions to display high-risk warnings.
- **Enterprise Security**: Monitor brand-related suspicious domains to detect attacks early.
- **Security Research**: Analyze phishing trends and evaluate the effectiveness of detection algorithms.

## Technical Challenges and Optimization Directions

### Technical Challenges and Countermeasures
1. **Adversarial Attacks**: Character obfuscation (homoglyphs) → visual similarity features; DGA domains → DGA feature analysis; slow attacks → combining WHOIS behavior patterns.
2. **Model Interpretability**: SHAP values to quantify feature contributions, decision path visualization, and manual review feedback.
3. **Cold Start Problem**: Active learning to label new brand domains, zero-shot learning using text embeddings, and community collaboration for intelligence sharing.

## Comparison with Related Technologies and Summary Insights

### Comparison with Related Technologies
- **Traditional Blacklists**: Lag in timeliness, coverage depends on manual work, high maintenance cost; CT monitoring enables second-level discovery, full coverage, and automated maintenance.
- **DNS Monitoring**: Complementary to CT monitoring (CT discovers earlier, DNS captures non-CT domains).
### Summary Insights
- Core Design: Layered architecture balances speed and accuracy; multi-model integration improves stability; feature diversity enhances classification ability.
- Insights: CT logs can be used as a source of threat intelligence; classic algorithms (Aho-Corasick) still have value; machine learning needs to be designed in combination with domain characteristics.
