Zing Forum

Reading

Research on Training and Interpretability of Multimodal Reasoning Models: Inference Circuit Identification Using GRPO and Sparse Autoencoders

This project explores how to train small multimodal reasoning models and uses sparse autoencoders to identify their internal inference circuits, providing new insights for understanding the reasoning mechanisms of large multimodal models.

多模态推理模型GRPO稀疏自编码器可解释性思维链强化学习Qwen
Published 2026-05-19 15:35Recent activity 2026-05-19 15:48Estimated read 6 min
Research on Training and Interpretability of Multimodal Reasoning Models: Inference Circuit Identification Using GRPO and Sparse Autoencoders
1

Section 01

[Project Introduction] Core Overview of Research on Training and Interpretability of Multimodal Reasoning Models

This project focuses on the training and interpretability of multimodal reasoning models. It explores fine-tuning the Qwen3.5-4B model using the Group Relative Policy Optimization (GRPO) algorithm to generate explicit thought chains, and plans to use sparse autoencoders to identify its internal inference circuits. The aim is to open the "black box" of large multimodal models and provide new insights into understanding their reasoning mechanisms. Currently, baseline evaluation experiments have been completed, verifying the potential of GRPO and the critical impact of evaluation design on results.

2

Section 02

Project Background and Research Motivation

As large multimodal language models (MLLMs) perform well in tasks such as visual question answering and image-text understanding, researchers are concerned about the black-box problem of their internal reasoning mechanisms. Traditional training methods improve performance but lack an understanding of the working mechanisms. This project, "multimodal-reasoning-interp", combines reinforcement learning training and interpretability analysis to attempt to solve this core problem.

3

Section 03

Analysis of Core Technical Route

The project adopts two parallel strategies: 1. Fine-tune the Qwen3.5-4B model using the GRPO algorithm to generate explicit thought chains under image input; 2. After training, use sparse autoencoders to analyze the model's internal activations and identify neural circuits related to reasoning. Compared to PPO, GRPO is more stable and efficient. It updates the policy through relative rewards of in-group samples and does not require separate training of a value function, making it suitable for small-scale experiments.

4

Section 04

Baseline Evaluation Experiments and Key Findings

The project completed baseline evaluation in the first week, designing two groups of comparative tests (50 multimodal questions): The v1 experiment had an overall accuracy of 34% (0% for floating-point numbers) due to output token limits (1024) and strict formatting; After adjusting v2 to a token limit of 2048 plus an intelligent answer extraction mechanism, the overall accuracy rose to 66% (100% for floating-point numbers, 75% for integers, and 59% for text), proving that sufficient output space and a robust parsing mechanism are crucial for multi-step reasoning tasks.

5

Section 05

Sparse Autoencoders and Inference Circuit Identification Plan

The project plans to use sparse autoencoders (unsupervised learning that learns an overcomplete dictionary to reconstruct inputs through sparsity constraints) to analyze the activations of the model's middle layers and identify neural circuits responsible for functions such as "visual feature extraction", "numerical calculation", and "logical inference". Subsequent ablation experiments will verify the functions of these circuits and establish a causal link between internal mechanisms and external behaviors.

6

Section 06

Technical Implementation and Reproduction Guide

The project uses a Python 3.11 environment and relies on the uv management tool. Key dependencies include PyTorch, Transformers, Datasets, Accelerate, PEFT, BitsAndBytes, TRL, etc., as well as Weights & Biases for experiment tracking. Reproduction steps: Clone the repository → Create a virtual environment → Install dependencies → Launch the experiment. Modules are separated for on-demand use.

7

Section 07

Research Significance and Future Outlook

This project builds a closed-loop path of "training-analysis-understanding". While improving reasoning capabilities, it deeply understands internal mechanisms, which is of great significance for building trustworthy and interpretable AI systems. Currently in the early stage (Phase 1), the sparse autoencoder analysis is not yet completed, but existing results show the potential of GRPO and the impact of evaluation design. In the future, we will conduct in-depth interpretability analysis to contribute insights to the understanding of the mechanisms of large multimodal models.