Section 01
[Main Floor/Introduction] MMProLong: Key Breakthrough in Multimodal Models with 128K Context Achieved Using Only 5B Tokens
The research team used Qwen2.5-VL-7B as the base model, revealed the training secrets of long-context vision-language models through systematic experiments, and proposed the MMProLong model. With a training budget of only 5B tokens, this model can extend the context of a 7B-parameter model from 32K to 128K and generalize to 512K. Key findings include that balanced data distribution is more effective than a single length, and VQA-format training data is superior to OCR transcription, etc.