Reading

MMProLong: A Multimodal Large Model Supporting 128K Context Trained with Only 5B Tokens

The research team revealed the training secrets of long-context vision-language models through systematic experiments, finding that balanced data distribution is more effective than focusing on a single length. They proposed the MMProLong model, which can extend a 7B-parameter model to 128K context with only 5B tokens and generalize to 512K.

长上下文视觉语言模型多模态MMProLong持续预训练Qwen2.5-VLVQA检索能力

Published 2026-05-14 01:52Recent activity 2026-05-14 10:19Estimated read 6 min

MMProLong: A Multimodal Large Model Supporting 128K Context Trained with Only 5B Tokens

Section 01

[Main Floor/Introduction] MMProLong: Key Breakthrough in Multimodal Models with 128K Context Achieved Using Only 5B Tokens

The research team used Qwen2.5-VL-7B as the base model, revealed the training secrets of long-context vision-language models through systematic experiments, and proposed the MMProLong model. With a training budget of only 5B tokens, this model can extend the context of a 7B-parameter model from 32K to 128K and generalize to 512K. Key findings include that balanced data distribution is more effective than a single length, and VQA-format training data is superior to OCR transcription, etc.

Section 02

Background: Long-Context Capability is the Next Battlefield for Multimodal Large Models

As text large models achieve million-level context breakthroughs, vision-language models (LVLMs) are accelerating their pursuit of long-context capabilities. Scenarios such as long document understanding, long video analysis, and multi-turn tool calls require models to manage massive visual-text mixed information, but research on multimodal long-context training lags behind, especially lacking systematic guidance on data ratio design.

Section 03

Core Findings: Key Rules for Long-Context Training

VQA Format is Superior to OCR Transcription: VQA-format training data significantly outperforms OCR transcription in long-context evaluation, as it is closer to the visual-language interaction needs of real scenarios;
Balanced Data Distribution is More Effective: Balanced data containing sequences of various lengths works better than data focusing on a single target length, with the core being to cultivate generalizable key information retrieval capabilities;
Retrieval is a Core Capability: Retrieval-intensive data paired with an appropriate amount of reasoning data is optimal; reasoning is the icing on the cake;
Pure Long Data Does Not Affect Short-Context Capability: Under training with pure long-document VQA data, the model's performance on short-context tasks shows almost no decline.

Section 04

MMProLong Model: Performance Breakthrough with a Small Budget

Based on the core findings, the research team trained the MMProLong model:

Base model: Qwen2.5-VL-7B;
Training data: 5B tokens of long-document VQA data;
Context extension: from 32K to 128K;
Performance improvement: 7.1% increase in long-document VQA scores;
Ultra-long generalization: maintains strong performance in 256K and 512K contexts without specialized training;
Multi-scenario transfer: performs well in tasks such as web multimodal needle retrieval and long video understanding.

Section 05

Practical Insights: 4 Guidelines for Multimodal Long-Context Training

Prioritize VQA for Data Format: VQA format is closer to practical applications and has higher training efficiency;
Balance Length Distribution: Avoid over-focusing on a single length; ensure the model is fully trained across all lengths and positions;
Retrieval is Core: Training data should focus on retrieval tasks, with reasoning tasks as supplementary;
Both Long and Short Can Be Achieved: Training with pure long data does not harm short-context capabilities, simplifying the data preparation process.

Section 06

Future Direction: Long Context Will Become a Standard Feature of Multimodal Models

With the explosion of scenarios such as video, long documents, and multi-turn interactions, long-context capability will become a standard feature of multimodal large models. The research on MMProLong not only provides an efficient training solution but also establishes a theoretical framework that "retrieval capability is fundamental, and length is superficial", guiding subsequent research toward mechanism understanding and capability expansion.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15