# SIEVES: A Selective Prediction Method via Visual Evidence Scoring

> This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and learn to evaluate its quality. It increases coverage by up to 3x across 5 OOD benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T16:57:29.000Z
- 最近活动: 2026-04-29T02:46:25.425Z
- 热度: 141.2
- 关键词: 选择性预测, 视觉证据, 多模态模型, OOD泛化, 模型可靠性, 视觉问答, 迁移学习, 可解释AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/sieves
- Canonical: https://www.zingnex.cn/forum/thread/sieves
- Markdown 来源: floors_fallback

---

## Introduction to the SIEVES Framework: A New Selective Prediction Method Based on Visual Evidence Scoring

# Key Points of SIEVES
This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and evaluate its quality. It increases coverage by up to 3x across 5 out-of-distribution (OOD) benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro, providing a new solution for the reliable deployment of multimodal models.

## Background: Reliability Challenges of Multimodal Models and Selective Prediction

# Real-World Dilemmas of Multimodal Models
Multimodal Large Language Models (MLLMs) have nearly saturated accuracy on traditional visual question answering benchmarks, but they tend to confidently output incorrect answers when facing OOD scenarios (low-quality images, rare objects, ambiguous questions, etc.). Selective prediction, which assigns confidence scores to answers and maximizes coverage under risk constraints, is a key approach to solving this problem.

## Core Innovation of SIEVES: Visual Evidence-Driven Selective Prediction

# Two Core Components of the SIEVES Framework
Key Insight of SIEVES: Reliable answers need to be accompanied by reliable visual evidence. The framework includes:
1. **Reasoning Model**: Generates localized visual evidence pointing to relevant regions in the image (grounding capability);
2. **Selector**: Evaluates the accuracy and relevance of visual evidence instead of relying solely on answer confidence.

## Experimental Setup: Strict OOD Benchmarks and Multi-Model Coverage

# Details of Experimental Design
- **OOD Benchmarks**: Covers five challenging scenarios: V*Bench (fine-grained understanding), HR-Bench-8k (high resolution), MME-RealWorld-Lite (real-world scenes), VizWiz (questions from visually impaired users), and AdVQA (adversarial VQA);
- **Model Coverage**: Pixel-Reasoner (open-source), o3 (OpenAI proprietary), Gemini-3-Pro (Google proprietary), and transfer to proprietary models is possible without needing internal weights.

## Core Results: 3x Coverage Improvement and Cross-Model Transfer Capability

# Highlights of Experimental Results
- **Coverage Improvement**: Compared to non-grounding baselines, it achieves up to 3x coverage improvement across 5 OOD benchmarks;
- **Transfer Capability**: The selector trained on Pixel-Reasoner can be directly applied to o3 and Gemini-3-Pro without additional training, leading to significant performance improvements.

## Technical Depth: Why is Visual Evidence Effective?

# Value of Visual Evidence
- **Beyond Confidence**: Traditional methods rely on poorly calibrated model confidence, while visual evidence provides an independent verifiable signal (e.g., accurately pointing to the image region corresponding to the answer);
- **Interpretability**: When the system abstains, the reason can be understood through evidence quality, and when answering, it provides traceable basis, enhancing system auditability.

## Practical Significance and Future Research Directions

# Application Implications and Future Exploration
- **Deployment Value**: Provides a reliable framework for the practical deployment of MLLMs, enhancing system credibility by "showing the work";
- **Proprietary Model Adaptation**: Improves the reliability of proprietary API models without fine-tuning the underlying models;
- **Future Directions**: Extend to complex reasoning tasks, causal attribution research, video/multi-image scenarios, etc.