# Sherlock: A Self-Correcting Reasoning Framework for Vision-Language Models

> The open-source implementation of Sherlock, a NeurIPS 2025 accepted paper, is released. It is the first framework to enable intrinsic self-correcting capabilities in vision-language models (VLMs), achieving significant improvements on multiple benchmarks with only 20K samples.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T06:55:20.000Z
- 最近活动: 2026-06-04T07:24:23.438Z
- 热度: 152.5
- 关键词: Sherlock, 视觉语言模型, 自我纠错, 推理, NeurIPS 2025, VLM, LLaVA-CoT, 多阶段训练, 自我改进
- 页面链接: https://www.zingnex.cn/en/forum/thread/sherlock
- Canonical: https://www.zingnex.cn/forum/thread/sherlock
- Markdown 来源: floors_fallback

---

## [Introduction] Sherlock: Open-Source Self-Correcting Reasoning Framework for VLMs (NeurIPS 2025 Accepted)

Sherlock is the first framework to enable intrinsic self-correcting capabilities in vision-language models (VLMs). Its paper has been accepted by NeurIPS 2025 and open-sourced. The framework achieves significant improvements on multiple benchmarks with only 20K samples. Author: DripNowhy. Project repository link: https://github.com/DripNowhy/Sherlock. Paper link: http://arxiv.org/abs/2505.22651. Released on June 4, 2026.

## Background: Core Bottleneck of VLMs' Reasoning Ability—Lack of Self-Correction

Vision-language models have made rapid progress in tasks like image understanding and visual question answering, but they face significant bottlenecks in complex reasoning: existing models trained via SFT or RL struggle to perform step-by-step or holistic self-correction. The root cause lies in the training paradigm focusing on single-inference accuracy while ignoring the model's self-verification and correction capabilities. How to enable VLMs to have human-like "reflection-correction" abilities is a key challenge.

## Sherlock Framework: Multi-Stage Training Scheme for Intrinsic Self-Correction

The core innovations of the Sherlock framework include: 1. Intrinsic self-correction mechanism without external prompts/multiple sampling; 2. Data efficiency with only 20K samples; 3. Three-stage training process (supervised fine-tuning → offline self-improvement → online self-improvement); 4. Cross-benchmark generalization ability. Built on Llama3.2-Vision-11B-Instruct:
- Stage 1: Supervised fine-tuning with 20K LLaVA-CoT samples to master basic reasoning abilities;
- Stage 2: Offline training with self-generated data to learn error recovery strategies;
- Stage 3: Online construction of 5K preference data using questions and images to dynamically adjust reasoning strategies.

## Experimental Evidence: Multi-Benchmark Performance Breaks Through Self-Correction Bottleneck

Using the VLMEvalKit evaluation framework, covering tasks like mathematical reasoning, scientific question answering, and visual common sense reasoning. Key findings: Existing VLMs generally lack self-correction capabilities; Sherlock breaks through this bottleneck via its innovative framework and achieves significant improvements. Model weights have been released: https://huggingface.co/collections/Tuwhy/sherlock-6835f46e450a48f228f7e80d.

## Usage Guide: Environment Preparation and Code Implementation

Implemented with modifications based on LLaMA-Factory and VLMEvalKit:
- Base model: Download Llama3.2-Vision-11B-Instruct (https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct);
- Training data: LLaVA-CoT dataset (https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k). Use 20K samples for SFT/offline stages, and construct 5K preference data using questions + images for the online stage;
- Inference example: Load Sherlock Iter2 weights (Tuwhy/Sherlock-Iter2) and implement inference via the transformers library;
- For training and evaluation guidelines, refer to train/README.md and inference/README.md in the project repository.

## Project Impact: Academic Recognition and Open-Source Contribution

The Sherlock paper was accepted by NeurIPS 2025, marking the important academic value of self-correcting reasoning in the VLMs field. The project is fully open-sourced, including training/evaluation code, data construction processes, model weights, and documentation, providing support for community research. Inspiring directions: self-correction mechanism design, data efficiency optimization, and vision-language fusion reasoning.

## Summary: Sherlock's Breakthroughs and Future Value

As the first intrinsic self-correcting VLMs framework, Sherlock achieves performance improvements with a small number of samples through multi-stage training, proving the key role of self-correction in reasoning abilities. Open-source resources will promote the development of VLMs towards more intelligent and reliable reasoning systems, providing a new paradigm for research in related fields.
