Section 01
PPV-CPT Framework Guide: Cultivating Core Capabilities of Multimodal Agents During Pre-Training
PPV-CPT (Perceive-Predict-Verify Continual Pre-Training) is an innovative framework that introduces the Perception-Prediction-Verification (PPV) loop during the continual pre-training phase, enabling vision-language models (VLMs) to acquire agent visual reasoning capabilities before task-specific fine-tuning, thus addressing the disconnection between perception and action in traditional training. Its core idea is to forge agent visual reasoning as a foundational capability during pre-training, providing a stronger starting point for subsequent SFT/RL.