Zing Forum

Reading

MemAgent: A Reinforcement Learning-Based Memory Agent Framework for Ultra-Long Contexts

MemAgent trains memory agents via end-to-end reinforcement learning, enabling handling of ultra-long contexts up to 3.5 million tokens without modifying the model architecture, achieving over 95% accuracy in the 512K RULER test.

long contextmemory agentreinforcement learningRLVRagent workflow上下文窗口强化学习
Published 2026-05-12 23:41Recent activity 2026-05-12 23:48Estimated read 6 min
MemAgent: A Reinforcement Learning-Based Memory Agent Framework for Ultra-Long Contexts
1

Section 01

MemAgent: Introduction to the Reinforcement Learning-Based Memory Agent Framework for Ultra-Long Contexts

This article introduces the MemAgent framework, which trains memory agents via end-to-end reinforcement learning. It can handle ultra-long contexts up to 3.5 million tokens without modifying the model architecture, achieving over 95% accuracy in the 512K RULER test. It addresses the computational bottlenecks and information loss issues in long context processing at its core, opening up a new direction for long text processing.

2

Section 02

Challenges in Ultra-Long Context Processing

The context window length of large language models is a practical bottleneck. Existing extension techniques (e.g., positional encoding extrapolation, sliding window attention) have computational complexity that grows quadratically with sequence length, making processing million-level tokens extremely costly; simple truncation or chunking easily leads to cross-chunk information loss, affecting task performance.

3

Section 03

Core Architecture and Innovations of MemAgent

MemAgent trains memory agents via end-to-end reinforcement learning without modifying the underlying model architecture. Key innovations include: linear time complexity (resource consumption is linearly related to text length); Reinforcement Learning with Verifiable Rewards (RLVR) to optimize multi-turn context-independent dialogue workflows; excellent extrapolation capability (training on 8K can extrapolate to 32K, and after RL training, the performance loss for 3.5 million token QA is <5%). Its multi-turn context-independent dialogue framework allows agents to actively manage memory, and the asynchronous Agent framework (RayActor parallelism) avoids blocking.

4

Section 04

Performance Validation

MemAgent performs excellently in ultra-long context tasks: the 14B model handles 3.5 million token QA with almost no loss; the 7B model achieves over 95% accuracy in the 512K RULER test; extrapolating from 8K training context to 3.5 million tokens, the performance degradation is controlled within 5%, proving the architecture's effectiveness and the scalability of RL training.

5

Section 05

Deployment and Training Guide

Quick Deployment: For local use, use the vLLM service (example script: vllm serve BytedTsinghua-SIA/RL-MemoryAgent-14B --tensor_parallel_size 2 + python quickstart.py), or configure environment variables to connect to online models.

Training Framework: General end-to-end RL training, supporting multi-step Agent workflows. Data is processed using HotpotQA (synthesizing long-context multi-hop data, filtering samples that do not require context); models support the Qwen2.5-Instruct series (need to configure YaRN to activate long context); supports single/multi-node Ray cluster training.

6

Section 06

Application Scenarios and Significance

MemAgent can be applied to: document understanding (entire books, legal contracts), code analysis (global understanding of large codebases), scientific research (long papers/multi-document reviews), and dialogue systems (long-term memory of conversation history). Its release is a milestone in the field of long text processing, breaking through traditional context limitations.

7

Section 07

Summary and Community Contributions

MemAgent breaks through context length limitations via memory agent architecture and RL training; its linear complexity and extrapolation capability open up a new direction for long text processing. The project is built on verl, open-sourcing the training framework, evaluation tools, and pre-trained models (7B/14B), providing the community with a complete toolchain. Future plans include exploring multimodal extensions and more application scenarios.