Zing Forum

Reading

ViTPhishFusion: A Multimodal Phishing Website Detection System Fusing Visual and URL Features

ViTPhishFusion is an innovative multimodal phishing website detection system. By combining Vision Transformer (ViT) visual features and URL lexical features, it achieves 80% accuracy and 85% recall on a custom dataset of 6000 websites, effectively identifying visually deceptive phishing attacks.

钓鱼网站检测Vision Transformer多模态学习网络安全ViTURL分析视觉特征机器学习
Published 2026-06-13 18:41Recent activity 2026-06-13 18:51Estimated read 6 min
ViTPhishFusion: A Multimodal Phishing Website Detection System Fusing Visual and URL Features
1

Section 01

Introduction: Core Overview of the ViTPhishFusion Multimodal Phishing Detection System

ViTPhishFusion is an innovative multimodal phishing website detection system whose core lies in fusing Vision Transformer (ViT) visual features and URL lexical features to address the visual deception challenges of modern phishing attacks. The system achieves 80% accuracy and 85% recall on a custom dataset containing 6000 website samples, effectively identifying visually realistic phishing attacks.

2

Section 02

Background: Detection Dilemma of Visually Deceptive Phishing Attacks

Modern phishing attackers have adopted highly realistic visual designs (such as precise color matching, realistic logos, and professional typography), making phishing pages almost indistinguishable from legitimate websites in appearance. Traditional detection methods based on blacklists and rule matching miss reports due to lack of visual understanding capabilities, and ViTPhishFusion is a solution proposed to address this pain point.

3

Section 03

Core Architecture: Dual Extraction of Visual and URL Features

Visual Feature Extraction

Vision Transformer (ViT) is used to process web page screenshots: the screenshot is divided into image patches, and global visual information such as layout, color, and logo position is captured through the self-attention mechanism, outputting an embedding vector encoding visual features.

URL Lexical Feature Engineering

Hand-designed URL features are extracted, including length, number of dots, hyphen/digit ratio, presence of @ symbol, HTTPS status, IP address detection, suspicious keywords (e.g., login, verify), etc., which are used after standardization.

4

Section 04

Feature Fusion and Classification Mechanism: Comprehensive Utilization of Multimodal Information

The system concatenates the visual embedding vector extracted by ViT with the URL lexical feature vector to form a comprehensive feature representation. The fused features are input into a fully connected classification network (including ReLU activation and Dropout regularization), and finally output the phishing probability through Sigmoid. This architecture combines visual style recognition and URL anomaly detection to reduce the risk of a single feature being bypassed.

5

Section 05

Dataset Construction and Experimental Results: Performance Metric Analysis

Dataset Construction

The custom dataset contains 6000 samples (3000 phishing / 3000 legitimate), covering various phishing types and legitimate website domains such as banking, e-commerce, and social media.

Experimental Results

Metric Value
Accuracy 80%
Recall 85%
F1 Score 0.80
The high recall rate (85%) is particularly critical, as it can effectively capture most phishing attacks and reduce the risk of missed detections.
6

Section 06

Practical Significance and Application Prospects: Value of Multimodal Detection

ViTPhishFusion represents an important direction in phishing detection technology:

  • End users: Can be integrated into browser extensions to warn of suspicious websites in real time;
  • Enterprises: As a supplementary layer for Web security gateways, capturing attacks missed by traditional solutions;
  • Researchers: Provides an extensible multimodal framework to explore more feature combinations. This system demonstrates the value of visual understanding in cybersecurity and promotes the development of multimodal security tools.
7

Section 07

Future Development Directions: Model Optimization and Productization

Future development directions include:

  1. Model Lightweighting: Train lightweight models through knowledge distillation to support browser extension/mobile device deployment;
  2. Productization: Develop browser extensions and REST API services;
  3. Interpretability: Build an AI explanation dashboard to explain suspicious visual elements and URL features;
  4. Dataset Expansion: Collect larger-scale datasets with multiple languages and attack types;
  5. ViT Fine-tuning: End-to-end fine-tuning of the ViT backbone network for phishing detection tasks.