Reading

SIEVES: A Selective Prediction Method via Visual Evidence Scoring

This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and learn to evaluate its quality. It increases coverage by up to 3x across 5 OOD benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro.

选择性预测视觉证据多模态模型OOD泛化模型可靠性视觉问答迁移学习可解释AI

Published 2026-04-29 00:57Recent activity 2026-04-29 10:46Estimated read 5 min

SIEVES: A Selective Prediction Method via Visual Evidence Scoring

Section 01

Introduction to the SIEVES Framework: A New Selective Prediction Method Based on Visual Evidence Scoring

Key Points of SIEVES

This paper proposes the SIEVES framework, which requires reasoning models to generate localized visual evidence and evaluate its quality. It increases coverage by up to 3x across 5 out-of-distribution (OOD) benchmarks and can be transferred to proprietary models like o3 and Gemini-3-Pro, providing a new solution for the reliable deployment of multimodal models.

Section 02

Background: Reliability Challenges of Multimodal Models and Selective Prediction

Real-World Dilemmas of Multimodal Models

Multimodal Large Language Models (MLLMs) have nearly saturated accuracy on traditional visual question answering benchmarks, but they tend to confidently output incorrect answers when facing OOD scenarios (low-quality images, rare objects, ambiguous questions, etc.). Selective prediction, which assigns confidence scores to answers and maximizes coverage under risk constraints, is a key approach to solving this problem.

Section 03

Core Innovation of SIEVES: Visual Evidence-Driven Selective Prediction

Two Core Components of the SIEVES Framework

Key Insight of SIEVES: Reliable answers need to be accompanied by reliable visual evidence. The framework includes:

Reasoning Model: Generates localized visual evidence pointing to relevant regions in the image (grounding capability);
Selector: Evaluates the accuracy and relevance of visual evidence instead of relying solely on answer confidence.

Section 04

Experimental Setup: Strict OOD Benchmarks and Multi-Model Coverage

Details of Experimental Design

OOD Benchmarks: Covers five challenging scenarios: V*Bench (fine-grained understanding), HR-Bench-8k (high resolution), MME-RealWorld-Lite (real-world scenes), VizWiz (questions from visually impaired users), and AdVQA (adversarial VQA);
Model Coverage: Pixel-Reasoner (open-source), o3 (OpenAI proprietary), Gemini-3-Pro (Google proprietary), and transfer to proprietary models is possible without needing internal weights.

Section 05

Core Results: 3x Coverage Improvement and Cross-Model Transfer Capability

Highlights of Experimental Results

Coverage Improvement: Compared to non-grounding baselines, it achieves up to 3x coverage improvement across 5 OOD benchmarks;
Transfer Capability: The selector trained on Pixel-Reasoner can be directly applied to o3 and Gemini-3-Pro without additional training, leading to significant performance improvements.

Section 06

Technical Depth: Why is Visual Evidence Effective?

Value of Visual Evidence

Beyond Confidence: Traditional methods rely on poorly calibrated model confidence, while visual evidence provides an independent verifiable signal (e.g., accurately pointing to the image region corresponding to the answer);
Interpretability: When the system abstains, the reason can be understood through evidence quality, and when answering, it provides traceable basis, enhancing system auditability.

Section 07

Practical Significance and Future Research Directions

Application Implications and Future Exploration

Deployment Value: Provides a reliable framework for the practical deployment of MLLMs, enhancing system credibility by "showing the work";
Proprietary Model Adaptation: Improves the reliability of proprietary API models without fine-tuning the underlying models;
Future Directions: Extend to complex reasoning tasks, causal attribution research, video/multi-image scenarios, etc.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23