# Beyond Semantics: Cross-Modal Synthetic Image Detection via Universal Physical Descriptors

> This paper systematically explores 15 physical features, identifies 5 core features that stably distinguish real from AI-generated images across over 20 datasets, and combines them with CLIP's semantic understanding. It achieves SOTA on the GenImage benchmark, with accuracy reaching up to 99.8% on some datasets.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-06T11:50:29.000Z
- 最近活动: 2026-04-07T07:54:38.589Z
- 热度: 126.9
- 关键词: 深度伪造检测, 物理特征, 跨模态学习, CLIP, AIGC, 图像真实性
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-04608v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-04608v1
- Markdown 来源: floors_fallback

---

## [Introduction] Beyond Semantics: New Breakthrough in Cross-Modal Synthetic Image Detection via Physical Features + CLIP

This paper addresses the deepfake detection challenges posed by AIGC, proposing a solution rooted in physical essence: systematically exploring 15 physical features, selecting 5 core features that are stable across datasets, and combining them with CLIP's semantic understanding. It achieves SOTA on the GenImage benchmark, with accuracy up to 99.8% on some datasets, effectively solving the problem of insufficient generalization ability of existing detectors.

## Background: Adaptability Crisis in Deepfake Detection

Existing deepfake detectors mostly rely on semantic features (e.g., texture, edge statistics) and are prone to overfitting to specific generative models. For example, GAN detectors perform poorly on diffusion models and cannot handle fake images from unknown generative architectures in real scenarios, leading to a 'cat-and-mouse game' dilemma. There is an urgent need for architecture-agnostic universal features.

## Method: Exploration of Physical Features and Identification of Core Features

The research team starts from physical laws, explores 15 candidate physical features (covering frequency domain, edge gradient, noise, statistics, and color dimensions), tests them on over 20 GAN/diffusion model datasets, and identifies 5 core features with stable discriminative power (e.g., Laplacian variance, Sobel statistics, residual noise variance) via feature selection algorithms, which have cross-dataset consistency.

## Method: Cross-Modal Fusion Strategy of Physical Features and CLIP

Physical features are textualized (e.g., 'Laplacian variance: 0.85') and integrated into the CLIP framework along with semantic descriptions. Through multimodal alignment, image visual features, physical text, and semantic descriptions are mapped to a unified embedding space, combining the generalization of physical features with the contextual advantages of semantic understanding.

## Experimental Evidence: SOTA Performance and Cross-Architecture Generalization

This method achieves SOTA on the GenImage benchmark, with 99.8% accuracy on Wukong and SDv1.4 datasets; it has outstanding cross-architecture generalization ability, maintaining stable performance on unseen generative models; compared to pure semantic methods, it is more robust against new generative architectures.

## Conclusion: Technical Significance and Application Prospects

This research provides a physical foundation for trustworthy AI and pioneers a new paradigm for cross-modal learning; it has high deployment value in scenarios such as social media moderation, news authenticity verification, and digital forensics, and can effectively address deepfake challenges.

## Limitations and Future Directions

The current research focuses on static images, and the applicability to video detection needs to be verified; future work will explore more physical features (e.g., optical model features), extend to video detection, and develop adaptive feature selection mechanisms.
