Section 01
[Introduction] Video-Zero: Core Interpretation of a New Video Understanding Method Based on Temporal Evidence Self-Evolution
Video-Zero is an annotation-free question-answering co-evolution framework. Its core lies in the Questioner identifying information-rich temporal evidence segments and generating questions that depend on these segments, while the Solver learns to answer and align with the evidence. This method consistently improves the performance of multiple Video Large Language Model (Video VLM) backbones across 13 video understanding benchmarks, providing a new path for the video understanding field to break free from reliance on manual annotations.