Zing Forum

Reading

Sherlock: A Self-Correcting Reasoning Framework for Vision-Language Models

The open-source implementation of Sherlock, a NeurIPS 2025 accepted paper, is released. It is the first framework to enable intrinsic self-correcting capabilities in vision-language models (VLMs), achieving significant improvements on multiple benchmarks with only 20K samples.

Sherlock视觉语言模型自我纠错推理NeurIPS 2025VLMLLaVA-CoT多阶段训练自我改进
Published 2026-06-04 14:55Recent activity 2026-06-04 15:24Estimated read 6 min
Sherlock: A Self-Correcting Reasoning Framework for Vision-Language Models
1

Section 01

[Introduction] Sherlock: Open-Source Self-Correcting Reasoning Framework for VLMs (NeurIPS 2025 Accepted)

Sherlock is the first framework to enable intrinsic self-correcting capabilities in vision-language models (VLMs). Its paper has been accepted by NeurIPS 2025 and open-sourced. The framework achieves significant improvements on multiple benchmarks with only 20K samples. Author: DripNowhy. Project repository link: https://github.com/DripNowhy/Sherlock. Paper link: http://arxiv.org/abs/2505.22651. Released on June 4, 2026.

2

Section 02

Background: Core Bottleneck of VLMs' Reasoning Ability—Lack of Self-Correction

Vision-language models have made rapid progress in tasks like image understanding and visual question answering, but they face significant bottlenecks in complex reasoning: existing models trained via SFT or RL struggle to perform step-by-step or holistic self-correction. The root cause lies in the training paradigm focusing on single-inference accuracy while ignoring the model's self-verification and correction capabilities. How to enable VLMs to have human-like "reflection-correction" abilities is a key challenge.

3

Section 03

Sherlock Framework: Multi-Stage Training Scheme for Intrinsic Self-Correction

The core innovations of the Sherlock framework include: 1. Intrinsic self-correction mechanism without external prompts/multiple sampling; 2. Data efficiency with only 20K samples; 3. Three-stage training process (supervised fine-tuning → offline self-improvement → online self-improvement); 4. Cross-benchmark generalization ability. Built on Llama3.2-Vision-11B-Instruct:

  • Stage 1: Supervised fine-tuning with 20K LLaVA-CoT samples to master basic reasoning abilities;
  • Stage 2: Offline training with self-generated data to learn error recovery strategies;
  • Stage 3: Online construction of 5K preference data using questions and images to dynamically adjust reasoning strategies.
4

Section 04

Experimental Evidence: Multi-Benchmark Performance Breaks Through Self-Correction Bottleneck

Using the VLMEvalKit evaluation framework, covering tasks like mathematical reasoning, scientific question answering, and visual common sense reasoning. Key findings: Existing VLMs generally lack self-correction capabilities; Sherlock breaks through this bottleneck via its innovative framework and achieves significant improvements. Model weights have been released: https://huggingface.co/collections/Tuwhy/sherlock-6835f46e450a48f228f7e80d.

5

Section 05

Usage Guide: Environment Preparation and Code Implementation

Implemented with modifications based on LLaMA-Factory and VLMEvalKit:

6

Section 06

Project Impact: Academic Recognition and Open-Source Contribution

The Sherlock paper was accepted by NeurIPS 2025, marking the important academic value of self-correcting reasoning in the VLMs field. The project is fully open-sourced, including training/evaluation code, data construction processes, model weights, and documentation, providing support for community research. Inspiring directions: self-correction mechanism design, data efficiency optimization, and vision-language fusion reasoning.

7

Section 07

Summary: Sherlock's Breakthroughs and Future Value

As the first intrinsic self-correcting VLMs framework, Sherlock achieves performance improvements with a small number of samples through multi-stage training, proving the key role of self-correction in reasoning abilities. Open-source resources will promote the development of VLMs towards more intelligent and reliable reasoning systems, providing a new paradigm for research in related fields.