Zing Forum

Reading

MMProLong: A Multimodal Large Model Supporting 128K Context Trained with Only 5B Tokens

The research team revealed the training secrets of long-context vision-language models through systematic experiments, finding that balanced data distribution is more effective than focusing on a single length. They proposed the MMProLong model, which can extend a 7B-parameter model to 128K context with only 5B tokens and generalize to 512K.

长上下文视觉语言模型多模态MMProLong持续预训练Qwen2.5-VLVQA检索能力
Published 2026-05-14 01:52Recent activity 2026-05-14 10:19Estimated read 6 min
MMProLong: A Multimodal Large Model Supporting 128K Context Trained with Only 5B Tokens
1

Section 01

[Main Floor/Introduction] MMProLong: Key Breakthrough in Multimodal Models with 128K Context Achieved Using Only 5B Tokens

The research team used Qwen2.5-VL-7B as the base model, revealed the training secrets of long-context vision-language models through systematic experiments, and proposed the MMProLong model. With a training budget of only 5B tokens, this model can extend the context of a 7B-parameter model from 32K to 128K and generalize to 512K. Key findings include that balanced data distribution is more effective than a single length, and VQA-format training data is superior to OCR transcription, etc.

2

Section 02

Background: Long-Context Capability is the Next Battlefield for Multimodal Large Models

As text large models achieve million-level context breakthroughs, vision-language models (LVLMs) are accelerating their pursuit of long-context capabilities. Scenarios such as long document understanding, long video analysis, and multi-turn tool calls require models to manage massive visual-text mixed information, but research on multimodal long-context training lags behind, especially lacking systematic guidance on data ratio design.

3

Section 03

Core Findings: Key Rules for Long-Context Training

  1. VQA Format is Superior to OCR Transcription: VQA-format training data significantly outperforms OCR transcription in long-context evaluation, as it is closer to the visual-language interaction needs of real scenarios;
  2. Balanced Data Distribution is More Effective: Balanced data containing sequences of various lengths works better than data focusing on a single target length, with the core being to cultivate generalizable key information retrieval capabilities;
  3. Retrieval is a Core Capability: Retrieval-intensive data paired with an appropriate amount of reasoning data is optimal; reasoning is the icing on the cake;
  4. Pure Long Data Does Not Affect Short-Context Capability: Under training with pure long-document VQA data, the model's performance on short-context tasks shows almost no decline.
4

Section 04

MMProLong Model: Performance Breakthrough with a Small Budget

Based on the core findings, the research team trained the MMProLong model:

  • Base model: Qwen2.5-VL-7B;
  • Training data: 5B tokens of long-document VQA data;
  • Context extension: from 32K to 128K;
  • Performance improvement: 7.1% increase in long-document VQA scores;
  • Ultra-long generalization: maintains strong performance in 256K and 512K contexts without specialized training;
  • Multi-scenario transfer: performs well in tasks such as web multimodal needle retrieval and long video understanding.
5

Section 05

Practical Insights: 4 Guidelines for Multimodal Long-Context Training

  1. Prioritize VQA for Data Format: VQA format is closer to practical applications and has higher training efficiency;
  2. Balance Length Distribution: Avoid over-focusing on a single length; ensure the model is fully trained across all lengths and positions;
  3. Retrieval is Core: Training data should focus on retrieval tasks, with reasoning tasks as supplementary;
  4. Both Long and Short Can Be Achieved: Training with pure long data does not harm short-context capabilities, simplifying the data preparation process.
6

Section 06

Future Direction: Long Context Will Become a Standard Feature of Multimodal Models

With the explosion of scenarios such as video, long documents, and multi-turn interactions, long-context capability will become a standard feature of multimodal large models. The research on MMProLong not only provides an efficient training solution but also establishes a theoretical framework that "retrieval capability is fundamental, and length is superficial", guiding subsequent research toward mechanism understanding and capability expansion.