Zing Forum

Reading

Zero-Shot Multimodal Anomaly Detection: A Training-Free Industrial Quality Inspection Solution Combining OWL-ViT and SAM

This project proposes a training-free zero-shot multimodal anomaly detection system that combines OWL-ViT v2 open-vocabulary detection and SAM pixel-level segmentation to enable natural language querying and precise localization of industrial defects like cracks, dents, and corrosion.

零样本学习多模态异常检测视觉语言模型OWL-ViTSAM工业质检开放词汇图像分割缺陷检测
Published 2026-05-25 02:13Recent activity 2026-05-25 02:19Estimated read 7 min
Zero-Shot Multimodal Anomaly Detection: A Training-Free Industrial Quality Inspection Solution Combining OWL-ViT and SAM
1

Section 01

Zero-Shot Multimodal Anomaly Detection: A Training-Free Industrial Quality Inspection Solution Combining OWL-ViT and SAM (Introduction)

This project proposes a training-free zero-shot multimodal anomaly detection system that combines OWL-ViT v2 open-vocabulary detection and SAM pixel-level segmentation to enable natural language querying and precise localization of industrial defects such as cracks, dents, and corrosion. The project is maintained by AC052001, and the source code is released on GitHub (link: https://github.com/AC052001/Zero-Shot-Multimodal-Anomaly-Detection-using-Vision-Language-Models). It was published on May 24, 2026.

2

Section 02

Background: Pain Points of Industrial Quality Inspection and Limitations of Existing Methods

Industrial quality inspection is a core part of manufacturing, but traditional methods face many challenges: manual inspection is inefficient and lacks consistency; traditional machine vision requires large amounts of annotation and training, making it difficult to adapt to new products or new defect iterations. Although anomaly detection based on supervised learning has made progress, it relies on large amounts of annotated data, while anomaly samples are scarce. The rise of Vision-Language Models (VLMs) provides new ideas to solve this problem—they are pre-trained on large-scale image-text data and have zero-shot and open-vocabulary capabilities.

3

Section 03

Method: Two-Stage Zero-Shot Detection and Segmentation Pipeline

The project uses a two-stage framework:

Stage 1: Open-Vocabulary Defect Detection

OWL-ViT v2 accepts natural language prompts (e.g., "crack", "corrosion") to detect potential anomaly regions and outputs bounding box proposals.

Stage 2: Pixel-Level Segmentation Refinement

SAM uses the bounding boxes generated by OWL-ViT as prompts to produce precise segmentation masks, defect boundaries, and heatmaps. The two complement each other to achieve a complete detection-segmentation process.

4

Section 04

Tech Stack and Implementation Details

The project is built based on an open-source tech stack:

Component Technology
Detection Model OWL-ViT v2
Segmentation Model SAM
Deep Learning Framework PyTorch
Multimodal Processing Hugging Face Transformers
Image Processing OpenCV
Visualization Matplotlib
The technology selection leverages the open-source ecosystem to ensure reproducibility and scalability.
5

Section 05

Application Scenarios and Value

The system is applicable to multiple scenarios:

  1. Industrial Quality Inspection: Real-time detection of surface defects in production line products (e.g., metal scratches, electronic welding defects) to reduce deployment costs;
  2. Infrastructure Monitoring: Detection of bridge cracks, road potholes, pipeline corrosion, etc., to assist maintenance decisions;
  3. Smart Factory Systems: Integration with robots and automated equipment to achieve fully automated quality control.
6

Section 06

Analysis of Core Advantages

Compared with traditional methods, the system has significant advantages:

  1. Eliminates Annotation Costs: No annotated data is required, lowering the entry barrier;
  2. Detects Unseen Anomalies: Open-vocabulary capability supports detection of defects not seen during training;
  3. Natural Language Interaction: Users can describe defects via natural language without modifying code;
  4. Precise Pixel Segmentation: SAM outputs high-quality masks to support quantitative defect analysis;
  5. Low Deployment Overhead: No training needed—environment setup and operation can be completed within hours.
7

Section 07

Limitations and Improvement Directions

Limitations

  • Dependency on Prompt Quality: Vague descriptions may reduce detection performance;
  • Challenge with Fine Anomalies: Difficulty in reliably detecting micron-level cracks;
  • Computational Resource Requirements: Large models affect real-time performance.

Improvement Directions

  • Real-time video anomaly detection;
  • Edge AI deployment optimization;
  • Temporal anomaly tracking;
  • Industrial Internet of Things (IIoT) integration;
  • Diffusion model-based segmentation quality refinement.
8

Section 08

Research Contributions and Conclusion

Research Contributions

Demonstrates the potential of VLMs in the field of industrial visual inspection. Through model combination, it achieves high-quality training-free anomaly detection and segmentation, opening up new paths for industrial AI applications.

Conclusion

This project provides a practical tool for the intelligent upgrading of manufacturing. With the development of multimodal AI technology, zero-shot/few-shot solutions are expected to be popularized in more industrial scenarios, promoting the deepening of intelligent detection technology.