Section 01
[Project Introduction] Core Overview of Research on Training and Interpretability of Multimodal Reasoning Models
This project focuses on the training and interpretability of multimodal reasoning models. It explores fine-tuning the Qwen3.5-4B model using the Group Relative Policy Optimization (GRPO) algorithm to generate explicit thought chains, and plans to use sparse autoencoders to identify its internal inference circuits. The aim is to open the "black box" of large multimodal models and provide new insights into understanding their reasoning mechanisms. Currently, baseline evaluation experiments have been completed, verifying the potential of GRPO and the critical impact of evaluation design on results.