Section 01
FlashVID: A Training-Free Efficient Acceleration Scheme for Video Large Language Models (Introduction)
FlashVID is a training-free acceleration scheme for video large language models. Its core uses a tree-structured spatiotemporal token merging strategy to increase inference efficiency by several times without retraining the models, while maintaining high-quality output. This scheme is an ICLR 2026 Oral paper and has been open-sourced. It has advantages such as training independence and flexible deployment, and is applicable to various pre-trained video LLMs.