Section 01
PRISM: Guide to the Black-box Policy Distillation Pre-alignment Method in Multimodal Reinforcement Learning
PRISM is a three-stage training process proposed to address the distribution drift problem in multimodal reinforcement learning. Its core is inserting an explicit distribution alignment phase between SFT (Supervised Fine-tuning) and RLVR (Reinforcement Learning with Verifiable Rewards). It uses a MoE (Mixture of Experts) discriminator to provide decoupled correction signals for perception and reasoning. This method achieves significant performance improvements on the Qwen3-VL model, offering a new paradigm for optimizing multimodal model training processes.