# ViTPhishFusion: A Multimodal Phishing Website Detection System Fusing Visual and URL Features

> ViTPhishFusion is an innovative multimodal phishing website detection system. By combining Vision Transformer (ViT) visual features and URL lexical features, it achieves 80% accuracy and 85% recall on a custom dataset of 6000 websites, effectively identifying visually deceptive phishing attacks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-13T10:41:35.000Z
- 最近活动: 2026-06-13T10:51:17.279Z
- 热度: 150.8
- 关键词: 钓鱼网站检测, Vision Transformer, 多模态学习, 网络安全, ViT, URL分析, 视觉特征, 机器学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/vitphishfusion-url
- Canonical: https://www.zingnex.cn/forum/thread/vitphishfusion-url
- Markdown 来源: floors_fallback

---

## Introduction: Core Overview of the ViTPhishFusion Multimodal Phishing Detection System

ViTPhishFusion is an innovative multimodal phishing website detection system whose core lies in fusing Vision Transformer (ViT) visual features and URL lexical features to address the visual deception challenges of modern phishing attacks. The system achieves 80% accuracy and 85% recall on a custom dataset containing 6000 website samples, effectively identifying visually realistic phishing attacks.

## Background: Detection Dilemma of Visually Deceptive Phishing Attacks

Modern phishing attackers have adopted highly realistic visual designs (such as precise color matching, realistic logos, and professional typography), making phishing pages almost indistinguishable from legitimate websites in appearance. Traditional detection methods based on blacklists and rule matching miss reports due to lack of visual understanding capabilities, and ViTPhishFusion is a solution proposed to address this pain point.

## Core Architecture: Dual Extraction of Visual and URL Features

### Visual Feature Extraction
Vision Transformer (ViT) is used to process web page screenshots: the screenshot is divided into image patches, and global visual information such as layout, color, and logo position is captured through the self-attention mechanism, outputting an embedding vector encoding visual features.
### URL Lexical Feature Engineering
Hand-designed URL features are extracted, including length, number of dots, hyphen/digit ratio, presence of @ symbol, HTTPS status, IP address detection, suspicious keywords (e.g., login, verify), etc., which are used after standardization.

## Feature Fusion and Classification Mechanism: Comprehensive Utilization of Multimodal Information

The system concatenates the visual embedding vector extracted by ViT with the URL lexical feature vector to form a comprehensive feature representation. The fused features are input into a fully connected classification network (including ReLU activation and Dropout regularization), and finally output the phishing probability through Sigmoid. This architecture combines visual style recognition and URL anomaly detection to reduce the risk of a single feature being bypassed.

## Dataset Construction and Experimental Results: Performance Metric Analysis

### Dataset Construction
The custom dataset contains 6000 samples (3000 phishing / 3000 legitimate), covering various phishing types and legitimate website domains such as banking, e-commerce, and social media.
### Experimental Results
| Metric | Value |
|--------|-------|
| Accuracy | 80% |
| Recall | 85% |
| F1 Score | 0.80 |
The high recall rate (85%) is particularly critical, as it can effectively capture most phishing attacks and reduce the risk of missed detections.

## Practical Significance and Application Prospects: Value of Multimodal Detection

ViTPhishFusion represents an important direction in phishing detection technology:
- End users: Can be integrated into browser extensions to warn of suspicious websites in real time;
- Enterprises: As a supplementary layer for Web security gateways, capturing attacks missed by traditional solutions;
- Researchers: Provides an extensible multimodal framework to explore more feature combinations.
This system demonstrates the value of visual understanding in cybersecurity and promotes the development of multimodal security tools.

## Future Development Directions: Model Optimization and Productization

Future development directions include:
1. **Model Lightweighting**: Train lightweight models through knowledge distillation to support browser extension/mobile device deployment;
2. **Productization**: Develop browser extensions and REST API services;
3. **Interpretability**: Build an AI explanation dashboard to explain suspicious visual elements and URL features;
4. **Dataset Expansion**: Collect larger-scale datasets with multiple languages and attack types;
5. **ViT Fine-tuning**: End-to-end fine-tuning of the ViT backbone network for phishing detection tasks.
