Section 01
V2PE: A New Approach to Enhance Multimodal Long-Context Understanding
The OpenGVLab team at Shanghai AI Laboratory proposed the V2PE (Variable Visual Positional Encoding) method, which significantly enhances the ability of vision-language models (VLMs) to handle ultra-long multimodal sequences by introducing variable and smaller positional increments for visual tokens, supporting a context length of up to 1 million tokens. This work has been accepted by ICCV 2025.