Section 01
Q-Fold: Introduction to Query-Aware Focus-Context Spatiotemporal Folding for Long Video Understanding
Q-Fold is a training-free input construction framework for long video understanding. Through query-aware heterogeneous focus-context representation, it simultaneously preserves high-fidelity visual evidence and broad temporal coverage under a limited visual budget, achieving a performance improvement of up to 9.1 percentage points on ultra-long video benchmarks. This framework breaks the traditional frame-centric paradigm, uses continuous time segments as units, and can be combined with existing Video-MLLMs without additional training costs.