Reading

ViTPhishFusion: A Multimodal Phishing Website Detection System Fusing Visual and URL Features

ViTPhishFusion is an innovative multimodal phishing website detection system. By combining Vision Transformer (ViT) visual features and URL lexical features, it achieves 80% accuracy and 85% recall on a custom dataset of 6000 websites, effectively identifying visually deceptive phishing attacks.

钓鱼网站检测Vision Transformer多模态学习网络安全ViTURL分析视觉特征机器学习

Published 2026-06-13 18:41Recent activity 2026-06-13 18:51Estimated read 6 min

ViTPhishFusion: A Multimodal Phishing Website Detection System Fusing Visual and URL Features

Section 01

Introduction: Core Overview of the ViTPhishFusion Multimodal Phishing Detection System

ViTPhishFusion is an innovative multimodal phishing website detection system whose core lies in fusing Vision Transformer (ViT) visual features and URL lexical features to address the visual deception challenges of modern phishing attacks. The system achieves 80% accuracy and 85% recall on a custom dataset containing 6000 website samples, effectively identifying visually realistic phishing attacks.

Section 02

Background: Detection Dilemma of Visually Deceptive Phishing Attacks

Modern phishing attackers have adopted highly realistic visual designs (such as precise color matching, realistic logos, and professional typography), making phishing pages almost indistinguishable from legitimate websites in appearance. Traditional detection methods based on blacklists and rule matching miss reports due to lack of visual understanding capabilities, and ViTPhishFusion is a solution proposed to address this pain point.

Section 03

Core Architecture: Dual Extraction of Visual and URL Features

Visual Feature Extraction

Vision Transformer (ViT) is used to process web page screenshots: the screenshot is divided into image patches, and global visual information such as layout, color, and logo position is captured through the self-attention mechanism, outputting an embedding vector encoding visual features.

URL Lexical Feature Engineering

Hand-designed URL features are extracted, including length, number of dots, hyphen/digit ratio, presence of @ symbol, HTTPS status, IP address detection, suspicious keywords (e.g., login, verify), etc., which are used after standardization.

Section 04

Feature Fusion and Classification Mechanism: Comprehensive Utilization of Multimodal Information

The system concatenates the visual embedding vector extracted by ViT with the URL lexical feature vector to form a comprehensive feature representation. The fused features are input into a fully connected classification network (including ReLU activation and Dropout regularization), and finally output the phishing probability through Sigmoid. This architecture combines visual style recognition and URL anomaly detection to reduce the risk of a single feature being bypassed.

Section 05

Dataset Construction and Experimental Results: Performance Metric Analysis

Dataset Construction

The custom dataset contains 6000 samples (3000 phishing / 3000 legitimate), covering various phishing types and legitimate website domains such as banking, e-commerce, and social media.

Experimental Results

Metric	Value
Accuracy	80%
Recall	85%
F1 Score	0.80
The high recall rate (85%) is particularly critical, as it can effectively capture most phishing attacks and reduce the risk of missed detections.

Section 06

Practical Significance and Application Prospects: Value of Multimodal Detection

ViTPhishFusion represents an important direction in phishing detection technology:

End users: Can be integrated into browser extensions to warn of suspicious websites in real time;
Enterprises: As a supplementary layer for Web security gateways, capturing attacks missed by traditional solutions;
Researchers: Provides an extensible multimodal framework to explore more feature combinations. This system demonstrates the value of visual understanding in cybersecurity and promotes the development of multimodal security tools.

Section 07

Future Development Directions: Model Optimization and Productization

Future development directions include:

Model Lightweighting: Train lightweight models through knowledge distillation to support browser extension/mobile device deployment;
Productization: Develop browser extensions and REST API services;
Interpretability: Build an AI explanation dashboard to explain suspicious visual elements and URL features;
Dataset Expansion: Collect larger-scale datasets with multiple languages and attack types;
ViT Fine-tuning: End-to-end fine-tuning of the ViT backbone network for phishing detection tasks.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23