Section 01
InternVideo3: Multimodal Context Reasoning Empowers Video Agents (Introduction)
This article introduces InternVideo3 developed by Shanghai AI Laboratory/OpenGVLab. It extends open-source multimodal models into visual agents supporting long-term video understanding and iterative interaction through Multimodal Context Reasoning (MCR) and Multimodal Multi-Head Latent Attention (M²LA) technologies. The model addresses challenges in video understanding such as long-term dependencies and temporal dynamics. The open-source project is available at https://github.com/OpenGVLab/InternVideo, and the original paper was published on arXiv (2026-06-10, link: http://arxiv.org/abs/2606.12195v1).