Zing Forum

Reading

STRIVE: Structured Spatiotemporal Exploration Makes Reinforcement Learning for Video Question Answering More Stable and Efficient

STRIVE addresses the problem of low reward variance by constructing spatiotemporal variants of videos and performing joint normalization across text generation and visual variants, consistently outperforming strong baselines on 6 video reasoning benchmarks.

视频问答STRIVE强化学习多模态时空探索VideoMMEGRPO联合归一化重要性采样
Published 2026-04-02 17:35Recent activity 2026-04-03 09:25Estimated read 7 min
STRIVE: Structured Spatiotemporal Exploration Makes Reinforcement Learning for Video Question Answering More Stable and Efficient
1

Section 01

Introduction: STRIVE—A Stable and Efficient New Solution for Reinforcement Learning in Video Question Answering

STRIVE (Structured Spatiotemporal Exploration Reinforcement Learning) is an innovative solution targeting the low reward variance problem in reinforcement learning (RL) training for video question answering. Its core idea is to construct spatiotemporal variants of videos and perform joint normalization across text generation and visual variants, significantly enhancing the richness of reward signals and making advantage estimation more stable. This method consistently outperforms strong baselines on 6 video reasoning benchmarks including VideoMME and TempCompass, effectively solving the dilemma of RL training being hard to converge or falling into local optima.

2

Section 02

Background: Core Dilemma of Reinforcement Learning for Video Question Answering

Video question answering is a core task in multimodal AI, requiring understanding of video content and answering questions. RL provides a training paradigm without token-wise supervision, but faces the unique challenge of excessively low reward variance in video question answering: when multiple answers generated by the model have similar correctness, the reward differences within the group are small, leading to weak or unstable advantage estimation, lack of clear signals for policy updates, and difficulty in training convergence.

3

Section 03

Core Insight of STRIVE: Innovative Idea of Cross-Modal Group Comparison

The core insight of STRIVE lies in cross-modal group comparison: not only comparing different text answers, but also generating spatiotemporal variants of videos (such as key frame selection, time range adjustment, spatial cropping), and combining each variant with text answers to form (video variant, text answer) pairs. Through this multi-dimensional comparison (text diversity, visual diversity, cross-modal interaction), the comparison space is expanded, providing richer reward signals and making advantage estimation more stable and meaningful.

4

Section 04

Construction of Spatiotemporal Variants: Importance-Aware Structured Exploration

STRIVE constructs spatiotemporal variants through an importance-aware sampling mechanism:

  1. Frame importance evaluation: Identify key frames related to the question through question-frame alignment, temporal attention, and multi-scale analysis;
  2. Variant generation strategies:
    • Temporal variants: High importance sampling, uniform sampling, random perturbation;
    • Spatial variants: Spatial cropping, multi-scale views, attention guidance. This design ensures that variants are structured and question-related semantic perturbations rather than random noise.
5

Section 05

Joint Normalization: Mathematical Foundation for Stable Advantage Estimation

Mathematical principle of joint normalization: For input video V and question Q, generate K spatiotemporal variants {V₁,..., Vₖ} and M text answers {A₁,..., Aₘ}, forming K×M combinations, each of which gets a reward R(Vᵢ, Aⱼ). Joint normalization calculates the advantage as A(Vᵢ, Aⱼ) = (R(Vᵢ, Aⱼ) - μ)/σ (where μ and σ are the mean and standard deviation of rewards for all combinations). Compared to text-only normalization, joint normalization uses a larger sample space, leading to more stable estimation and forcing the model to learn more robust visual understanding.

6

Section 06

Experimental Validation: Leading Results Across Six Benchmarks

STRIVE was validated on 6 video reasoning benchmarks (VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, PerceptionTest):

  • Results: Average accuracy increased by 3-8 percentage points, training reward curves are smoother, convergence is faster, and generalization ability is stronger;
  • Ablation experiments: Removing spatiotemporal variants/importance-aware sampling/joint normalization all led to significant performance drops, verifying the value of each component.
7

Section 07

Implications and Outlook: Future Directions for Multimodal RL

Implications: Cross-modal comparison can provide richer training signals; structured exploration (rather than random) is key to efficient learning for complex multimodal tasks; joint normalization suggests that all comparison dimensions should be fully utilized; Limitations and Future: Variant generation has high overhead and needs optimization; reliance on external evaluators may propagate biases; long video processing is challenging; future directions can include exploring efficient variant generation, combining with model architecture improvements, etc.