Section 01
Introduction: Cambrian-P—Enhancing Spatial Reasoning of Video Multimodal Large Models via Camera Pose
This article introduces Cambrian-P, a method that enhances the spatial reasoning ability of video multimodal large models (MLLMs) by incorporating camera pose signals. The method adds a learnable camera token and a pose regression head to each video frame, achieving a significant improvement of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, and reaching the state-of-the-art level in ScanNet streaming pose estimation.