Zing Forum

Reading

Panoramic Research on Multimodal Large Language Models: Latest Advances from the VITA Series to Video-MME-v2

This article comprehensively reviews the latest research advances in the field of Multimodal Large Language Models (MLLM), covering the VITA series of full-modal models, the Video-MME-v2 video understanding benchmark, and technical breakthroughs of mainstream models such as Qwen, InternVL, and MiniCPM. It demonstrates the rapid development trends of this field in directions like unified understanding and generation, long-context processing, and real-time interaction.

多模态大语言模型MLLMVITAVideo-MMEQwenInternVLMiniCPM全模态模型视频理解开源AI
Published 2026-04-09 17:08Recent activity 2026-04-09 17:22Estimated read 9 min
Panoramic Research on Multimodal Large Language Models: Latest Advances from the VITA Series to Video-MME-v2
1

Section 01

[Introduction] Panoramic Research on Multimodal Large Language Models: Latest Advances in the VITA Series and Video-MME-v2

This article comprehensively reviews the latest advances in the Multimodal Large Language Model (MLLM) field, covering the VITA series of full-modal models, the Video-MME-v2 video understanding benchmark, and technical breakthroughs of mainstream models like Qwen, InternVL, and MiniCPM. It showcases the rapid development trends of this field in directions such as unified understanding and generation, long-context processing, and real-time interaction. MLLMs are undergoing transformations from specialized to general-purpose, from understanding to generation, and from digital to physical domains. The open-source ecosystem is thriving, and the eve of large-scale applications is upon us.

2

Section 02

[Background] Development and Evaluation Challenges of Multimodal AI

Multimodal Large Language Models (MLLMs) have experienced explosive growth: from simple models for image-text tasks to full-modal systems that can simultaneously understand vision, audio, and language and interact in real time. The Multimodal Intelligence Group of Nanjing University (NJU-MiG) resource library summarizes core advances. In review research:

  • MME-Survey: The first comprehensive review of MLLM evaluation, pointing out challenges such as single evaluation dimension, insufficient coverage of real scenarios, and lack of robustness testing;
  • Unified multimodal understanding and generation: Traditional task-specific models are shifting to unified processing with a single architecture, which has the advantages of knowledge sharing and efficiency improvement, but challenges remain in modal alignment and generation quality.
3

Section 03

[Core Advances] VITA Series: Moving Toward Real-Time Interaction and Full-Modal Capabilities

The VITA (Vision, Interaction, Text, Audio) series is an open-source full-modal large language model jointly developed by Tencent and Nanjing University, representing one of the highest levels of open-source MLLMs:

  • VITA-1.5: A NeurIPS 2025 Highlight paper, achieving real-time visual-audio interaction close to GPT-4o, supporting simultaneous viewing, listening, and speaking with significantly reduced response latency;
  • VITA-E: Expanded to concurrent viewing, listening, speaking, and action capabilities, moving toward interaction with the physical world;
  • Long-VITA: Solves the long-context problem, extending to 1 million tokens while maintaining leading accuracy in short-context tasks;
  • VITA-Audio: Adopts fast interleaved cross-modal token generation technology to improve the inference efficiency of speech-language models.
4

Section 04

[Benchmark Testing] Video-MME-v2 Leads a New Stage in Video Understanding Evaluation

Video-MME-v2 is currently the most comprehensive video understanding benchmark, with breakthroughs compared to its predecessor:

  1. Wider coverage of video durations (from a few seconds to several hours);
  2. More diverse task types (understanding, reasoning, temporal localization, etc.);
  3. More refined difficulty stratification (from basic perception to high-level reasoning);
  4. More varied real scenarios (education, entertainment, sports, news, etc.).

This benchmark provides an authoritative evaluation standard for video understanding model research and development, promoting the development of sub-fields.

5

Section 05

[Mainstream Models] Technical Breakthroughs of Open-Source Models like Qwen, InternVL, and MiniCPM

Technical breakthroughs of mainstream open-source MLLMs:

  • Qwen series (Alibaba): Qwen3.5-Omni moves toward native full-modal general AI; Qwen3-VL leads in visual capabilities; Qwen2.5 series breaks through in fine-grained understanding and full-modal interaction;
  • InternVL series (Shanghai AI Laboratory): InternVL3.5 advances comprehensively in generality, reasoning, and efficiency; InternVL-U is a unified multimodal model; InternVL3 explores training and testing optimization strategies, serving as a baseline for academia and industry;
  • MiniCPM series (Tsinghua OpenBMB): MiniCPM-o4.5 achieves GPT-4o-level single-image/multi-image/video understanding on mobile phones; MiniCPM-V4.5 optimizes visual tasks, opening up possibilities for mobile AI applications.
6

Section 06

[Technical Trends] Emerging Directions of MLLMs: Unification, Reasoning, Long Context, and Embodied Intelligence

Emerging research directions for MLLMs:

  1. Unified understanding and generation: Models like Show-o/Show-o2, Emu3.5, MMaDA, and Omni-Diffusion attempt to handle understanding and generation tasks with a single architecture;
  2. Multimodal reasoning enhancement: GLM-4.1V-Thinking (reinforcement learning reasoning), LlamaV-o1 (step-by-step visual reasoning), Skywork R1V2 (hybrid reinforcement learning), QVQ, etc., improve reasoning capabilities;
  3. Long video and long context: Long-VITA (1 million tokens), LongVU (spatiotemporal compression), Eagle2.5 (post-training optimization), TimeMarker (temporal localization);
  4. Embodied intelligence and robotics: VITA-VLA (action expert distillation), VITA-E (embodied interaction) combine perception and action.
7

Section 07

[Challenges and Outlook] Existing Problems and Future Development in the MLLM Field

Challenges facing MLLMs:

  1. Modal alignment: Alignment of representation spaces across different modalities;
  2. Hallucination problem: Output inconsistent with input;
  3. Efficiency optimization: High cost of processing long videos/high-resolution images;
  4. Evaluation system: Existing benchmarks struggle to fully assess real capabilities;
  5. Safety and alignment: Complex safety alignment in multimodal scenarios.

Future outlook:

  • Truly unified full modality: Seamless integration of text, image, audio, video, and action;
  • Real-time interaction: Latency close to human dialogue;
  • Edge-cloud collaboration: Intelligent selection of edge or cloud execution;
  • Embodied intelligence: From digital to physical world perception and action.