Zing Forum

Reading

PRISM: A Black-box Policy Distillation Pre-alignment Method for Multimodal Reinforcement Learning

PRISM is a three-stage training process that mitigates distribution drift by inserting an explicit distribution alignment phase between SFT and RLVR. It uses a MoE discriminator to provide decoupled perception and reasoning correction signals, achieving significant performance improvements on Qwen3-VL.

多模态强化学习策略蒸馏PRISM分布对齐SFTRLVRQwen3-VL
Published 2026-05-01 01:12Recent activity 2026-05-01 10:31Estimated read 7 min
PRISM: A Black-box Policy Distillation Pre-alignment Method for Multimodal Reinforcement Learning
1

Section 01

PRISM: Guide to the Black-box Policy Distillation Pre-alignment Method in Multimodal Reinforcement Learning

PRISM is a three-stage training process proposed to address the distribution drift problem in multimodal reinforcement learning. Its core is inserting an explicit distribution alignment phase between SFT (Supervised Fine-tuning) and RLVR (Reinforcement Learning with Verifiable Rewards). It uses a MoE (Mixture of Experts) discriminator to provide decoupled correction signals for perception and reasoning. This method achieves significant performance improvements on the Qwen3-VL model, offering a new paradigm for optimizing multimodal model training processes.

2

Section 02

Distribution Drift Dilemma in Multimodal Model Training

The traditional SFT→RLVR training process for Large Multimodal Models (LMMs) has a fundamental problem—distribution drift: 1. Dual drift in SFT: Capability forgetting (losing pre-trained general knowledge), and supervision distribution mismatch (deviation between output and standard answers); 2. Multimodal compound drift: Perception errors (image understanding) and reasoning errors (logic) have different drift patterns, which compound each other during the RL phase, leading to unstable optimization.

3

Section 03

PRISM's Three-Stage Training Paradigm and Data Strategy

PRISM's three-stage process: 1. SFT Initialization: Fine-tune with 1.26M public demonstration data to build basic multimodal capabilities; 2. Distribution Alignment (Core): Based on black-box policy distillation, the MoE discriminator includes perception/reasoning experts to provide decoupled signals, without requiring teacher model logits (black-box property); 3. RLVR Optimization: RL training becomes more stable after alignment. Data for the alignment phase: 113K high-difficulty samples generated by Gemini 3 Flash are selected, featuring dense visual localization, step-by-step reasoning, and targeting model weak points.

4

Section 04

PRISM's Experimental Verification Results

Experiments on Qwen3-VL show: - Cross-algorithm consistency: RL algorithms like GRPO, DAPO, and GSPO all achieve performance improvements; - Scale scalability: 4B model accuracy +4.4%, 8B model +6.0% (vs SFT→RLVR baseline); - Generalization ability: Validated effective across multiple multimodal benchmark tests.

5

Section 05

PRISM's Technical Contributions

PRISM's core contributions: 1. Problem diagnosis: Reveals the essence of SFT drift and the differences between perception/reasoning drift in multimodal scenarios; 2. Method innovation: Extends policy distillation to black-box settings, reducing dependence on teacher models; 3. Architecture design: MoE discriminator decouples perception and reasoning evaluation; 4. Data strategy: Focuses on high-difficulty samples to improve training efficiency.

6

Section 06

PRISM's Open Source and Community Impact

The research team has open-sourced the code, data, and model checkpoints (GitHub link: https://github.com/XIAO4579/PRISM). Its value includes: - Reproducibility: Facilitates other researchers to verify results; - Transferability: The training process can be transferred to other multimodal models/tasks; - Baseline establishment: Provides a benchmark model for subsequent research.

7

Section 07

PRISM's Insights for the Industry

PRISM's insights for multimodal AI application development: 1. Training process optimization: The two-stage SFT-RL paradigm can improve performance by inserting an alignment phase; 2. Data value: Precise high-difficulty samples are more effective than blindly expanding data scale; 3. Practicality of black-box optimization: No need for internal information of teacher models, making it easier to implement applications.

8

Section 08

PRISM's Significance and Future Directions

PRISM is an important advancement in optimizing multimodal model training processes. It mitigates drift problems through explicit distribution alignment and significantly improves model performance. It not only provides a practical training method but also reveals the relationships between stages in post-training processes, pointing the way for future exploration of advanced training paradigms and will play an important role in key multimodal AI applications.