Section 01
Introduction: STRIVE—A Stable and Efficient New Solution for Reinforcement Learning in Video Question Answering
STRIVE (Structured Spatiotemporal Exploration Reinforcement Learning) is an innovative solution targeting the low reward variance problem in reinforcement learning (RL) training for video question answering. Its core idea is to construct spatiotemporal variants of videos and perform joint normalization across text generation and visual variants, significantly enhancing the richness of reward signals and making advantage estimation more stable. This method consistently outperforms strong baselines on 6 video reasoning benchmarks including VideoMME and TempCompass, effectively solving the dilemma of RL training being hard to converge or falling into local optima.