Reading

Research on Training and Interpretability of Multimodal Reasoning Models: Inference Circuit Identification Using GRPO and Sparse Autoencoders

This project explores how to train small multimodal reasoning models and uses sparse autoencoders to identify their internal inference circuits, providing new insights for understanding the reasoning mechanisms of large multimodal models.

多模态推理模型GRPO稀疏自编码器可解释性思维链强化学习Qwen

Published 2026-05-19 15:35Recent activity 2026-05-19 15:48Estimated read 6 min

Research on Training and Interpretability of Multimodal Reasoning Models: Inference Circuit Identification Using GRPO and Sparse Autoencoders

Section 01

[Project Introduction] Core Overview of Research on Training and Interpretability of Multimodal Reasoning Models

This project focuses on the training and interpretability of multimodal reasoning models. It explores fine-tuning the Qwen3.5-4B model using the Group Relative Policy Optimization (GRPO) algorithm to generate explicit thought chains, and plans to use sparse autoencoders to identify its internal inference circuits. The aim is to open the "black box" of large multimodal models and provide new insights into understanding their reasoning mechanisms. Currently, baseline evaluation experiments have been completed, verifying the potential of GRPO and the critical impact of evaluation design on results.

Section 02

Project Background and Research Motivation

As large multimodal language models (MLLMs) perform well in tasks such as visual question answering and image-text understanding, researchers are concerned about the black-box problem of their internal reasoning mechanisms. Traditional training methods improve performance but lack an understanding of the working mechanisms. This project, "multimodal-reasoning-interp", combines reinforcement learning training and interpretability analysis to attempt to solve this core problem.

Section 03

Analysis of Core Technical Route

The project adopts two parallel strategies: 1. Fine-tune the Qwen3.5-4B model using the GRPO algorithm to generate explicit thought chains under image input; 2. After training, use sparse autoencoders to analyze the model's internal activations and identify neural circuits related to reasoning. Compared to PPO, GRPO is more stable and efficient. It updates the policy through relative rewards of in-group samples and does not require separate training of a value function, making it suitable for small-scale experiments.

Section 04

Baseline Evaluation Experiments and Key Findings

The project completed baseline evaluation in the first week, designing two groups of comparative tests (50 multimodal questions): The v1 experiment had an overall accuracy of 34% (0% for floating-point numbers) due to output token limits (1024) and strict formatting; After adjusting v2 to a token limit of 2048 plus an intelligent answer extraction mechanism, the overall accuracy rose to 66% (100% for floating-point numbers, 75% for integers, and 59% for text), proving that sufficient output space and a robust parsing mechanism are crucial for multi-step reasoning tasks.

Section 05

Sparse Autoencoders and Inference Circuit Identification Plan

The project plans to use sparse autoencoders (unsupervised learning that learns an overcomplete dictionary to reconstruct inputs through sparsity constraints) to analyze the activations of the model's middle layers and identify neural circuits responsible for functions such as "visual feature extraction", "numerical calculation", and "logical inference". Subsequent ablation experiments will verify the functions of these circuits and establish a causal link between internal mechanisms and external behaviors.

Section 06

Technical Implementation and Reproduction Guide

The project uses a Python 3.11 environment and relies on the uv management tool. Key dependencies include PyTorch, Transformers, Datasets, Accelerate, PEFT, BitsAndBytes, TRL, etc., as well as Weights & Biases for experiment tracking. Reproduction steps: Clone the repository → Create a virtual environment → Install dependencies → Launch the experiment. Modules are separated for on-demand use.

Section 07

Research Significance and Future Outlook

This project builds a closed-loop path of "training-analysis-understanding". While improving reasoning capabilities, it deeply understands internal mechanisms, which is of great significance for building trustworthy and interpretable AI systems. Currently in the early stage (Phase 1), the sparse autoencoder analysis is not yet completed, but existing results show the potential of GRPO and the impact of evaluation design. In the future, we will conduct in-depth interpretability analysis to contribute insights to the understanding of the mechanisms of large multimodal models.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15