Zing Forum

Reading

SIEVES: A Selective Prediction Method via Visual Evidence Scoring

This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and learn to evaluate its quality. It increases coverage by up to 3x across 5 OOD benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro.

选择性预测视觉证据多模态模型OOD泛化模型可靠性视觉问答迁移学习可解释AI
Published 2026-04-29 00:57Recent activity 2026-04-29 10:46Estimated read 5 min
SIEVES: A Selective Prediction Method via Visual Evidence Scoring
1

Section 01

Introduction to the SIEVES Framework: A New Selective Prediction Method Based on Visual Evidence Scoring

Key Points of SIEVES

This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and evaluate its quality. It increases coverage by up to 3x across 5 out-of-distribution (OOD) benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro, providing a new solution for the reliable deployment of multimodal models.

2

Section 02

Background: Reliability Challenges of Multimodal Models and Selective Prediction

Real-World Dilemmas of Multimodal Models

Multimodal Large Language Models (MLLMs) have nearly saturated accuracy on traditional visual question answering benchmarks, but they tend to confidently output incorrect answers when facing OOD scenarios (low-quality images, rare objects, ambiguous questions, etc.). Selective prediction, which assigns confidence scores to answers and maximizes coverage under risk constraints, is a key approach to solving this problem.

3

Section 03

Core Innovation of SIEVES: Visual Evidence-Driven Selective Prediction

Two Core Components of the SIEVES Framework

Key Insight of SIEVES: Reliable answers need to be accompanied by reliable visual evidence. The framework includes:

  1. Reasoning Model: Generates localized visual evidence pointing to relevant regions in the image (grounding capability);
  2. Selector: Evaluates the accuracy and relevance of visual evidence instead of relying solely on answer confidence.
4

Section 04

Experimental Setup: Strict OOD Benchmarks and Multi-Model Coverage

Details of Experimental Design

  • OOD Benchmarks: Covers five challenging scenarios: V*Bench (fine-grained understanding), HR-Bench-8k (high resolution), MME-RealWorld-Lite (real-world scenes), VizWiz (questions from visually impaired users), and AdVQA (adversarial VQA);
  • Model Coverage: Pixel-Reasoner (open-source), o3 (OpenAI proprietary), Gemini-3-Pro (Google proprietary), and transfer to proprietary models is possible without needing internal weights.
5

Section 05

Core Results: 3x Coverage Improvement and Cross-Model Transfer Capability

Highlights of Experimental Results

  • Coverage Improvement: Compared to non-grounding baselines, it achieves up to 3x coverage improvement across 5 OOD benchmarks;
  • Transfer Capability: The selector trained on Pixel-Reasoner can be directly applied to o3 and Gemini-3-Pro without additional training, leading to significant performance improvements.
6

Section 06

Technical Depth: Why is Visual Evidence Effective?

Value of Visual Evidence

  • Beyond Confidence: Traditional methods rely on poorly calibrated model confidence, while visual evidence provides an independent verifiable signal (e.g., accurately pointing to the image region corresponding to the answer);
  • Interpretability: When the system abstains, the reason can be understood through evidence quality, and when answering, it provides traceable basis, enhancing system auditability.
7

Section 07

Practical Significance and Future Research Directions

Application Implications and Future Exploration

  • Deployment Value: Provides a reliable framework for the practical deployment of MLLMs, enhancing system credibility by "showing the work";
  • Proprietary Model Adaptation: Improves the reliability of proprietary API models without fine-tuning the underlying models;
  • Future Directions: Extend to complex reasoning tasks, causal attribution research, video/multi-image scenarios, etc.