# MMProLong: A Multimodal Large Model Supporting 128K Context Trained with Only 5B Tokens

> The research team revealed the training secrets of long-context vision-language models through systematic experiments, finding that balanced data distribution is more effective than focusing on a single length. They proposed the MMProLong model, which can extend a 7B-parameter model to 128K context with only 5B tokens and generalize to 512K.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T17:52:53.000Z
- 最近活动: 2026-05-14T02:19:10.482Z
- 热度: 133.6
- 关键词: 长上下文, 视觉语言模型, 多模态, MMProLong, 持续预训练, Qwen2.5-VL, VQA, 检索能力
- 页面链接: https://www.zingnex.cn/en/forum/thread/mmprolong-5b-token128k
- Canonical: https://www.zingnex.cn/forum/thread/mmprolong-5b-token128k
- Markdown 来源: floors_fallback

---

## [Main Floor/Introduction] MMProLong: Key Breakthrough in Multimodal Models with 128K Context Achieved Using Only 5B Tokens

The research team used Qwen2.5-VL-7B as the base model, revealed the training secrets of long-context vision-language models through systematic experiments, and proposed the MMProLong model. With a training budget of only 5B tokens, this model can extend the context of a 7B-parameter model from 32K to 128K and generalize to 512K. Key findings include that balanced data distribution is more effective than a single length, and VQA-format training data is superior to OCR transcription, etc.

## Background: Long-Context Capability is the Next Battlefield for Multimodal Large Models

As text large models achieve million-level context breakthroughs, vision-language models (LVLMs) are accelerating their pursuit of long-context capabilities. Scenarios such as long document understanding, long video analysis, and multi-turn tool calls require models to manage massive visual-text mixed information, but research on multimodal long-context training lags behind, especially lacking systematic guidance on data ratio design.

## Core Findings: Key Rules for Long-Context Training

1. **VQA Format is Superior to OCR Transcription**: VQA-format training data significantly outperforms OCR transcription in long-context evaluation, as it is closer to the visual-language interaction needs of real scenarios;
2. **Balanced Data Distribution is More Effective**: Balanced data containing sequences of various lengths works better than data focusing on a single target length, with the core being to cultivate generalizable key information retrieval capabilities;
3. **Retrieval is a Core Capability**: Retrieval-intensive data paired with an appropriate amount of reasoning data is optimal; reasoning is the icing on the cake;
4. **Pure Long Data Does Not Affect Short-Context Capability**: Under training with pure long-document VQA data, the model's performance on short-context tasks shows almost no decline.

## MMProLong Model: Performance Breakthrough with a Small Budget

Based on the core findings, the research team trained the MMProLong model:
- Base model: Qwen2.5-VL-7B;
- Training data: 5B tokens of long-document VQA data;
- Context extension: from 32K to 128K;
- Performance improvement: 7.1% increase in long-document VQA scores;
- Ultra-long generalization: maintains strong performance in 256K and 512K contexts without specialized training;
- Multi-scenario transfer: performs well in tasks such as web multimodal needle retrieval and long video understanding.

## Practical Insights: 4 Guidelines for Multimodal Long-Context Training

1. **Prioritize VQA for Data Format**: VQA format is closer to practical applications and has higher training efficiency;
2. **Balance Length Distribution**: Avoid over-focusing on a single length; ensure the model is fully trained across all lengths and positions;
3. **Retrieval is Core**: Training data should focus on retrieval tasks, with reasoning tasks as supplementary;
4. **Both Long and Short Can Be Achieved**: Training with pure long data does not harm short-context capabilities, simplifying the data preparation process.

## Future Direction: Long Context Will Become a Standard Feature of Multimodal Models

With the explosion of scenarios such as video, long documents, and multi-turn interactions, long-context capability will become a standard feature of multimodal large models. The research on MMProLong not only provides an efficient training solution but also establishes a theoretical framework that "retrieval capability is fundamental, and length is superficial", guiding subsequent research toward mechanism understanding and capability expansion.
