Zing Forum

Reading

MemDreamer: Long Video Understanding via Hierarchical Graph Memory and Agent-based Retrieval Mechanism

MemDreamer decouples perception and reasoning, adopts a hierarchical graph memory architecture and an agent-based retrieval mechanism, transforms long video understanding into an exploration process, and achieves SOTA performance while using only 2% of the context.

长视频理解视觉语言模型分层图记忆智能体检索感知推理解耦
Published 2026-06-06 01:59Recent activity 2026-06-08 11:22Estimated read 5 min
MemDreamer: Long Video Understanding via Hierarchical Graph Memory and Agent-based Retrieval Mechanism
1

Section 01

MemDreamer: A Groundbreaking Solution for Long Video Understanding

MemDreamer is an innovative solution for long video understanding. Its core lies in decoupling perception and reasoning, adopting a hierarchical graph memory architecture and an agent-based retrieval mechanism, and transforming long video understanding into an agent exploration process. This solution achieves SOTA performance while using only 2% of the context, effectively addressing the token explosion and attention dilution issues in long video processing.

2

Section 02

Core Challenges in Long Video Understanding

Current Vision-Language Models (VLMs) perform well in processing short videos, but face token explosion and attention dilution issues when dealing with hour-long videos: an hour-long video contains hundreds of thousands or even millions of frames, making full-input computation extremely costly, and models struggle to focus on key information, limiting practical applications such as surveillance analysis and documentary understanding.

3

Section 03

Core Methods: Decoupling Perception and Reasoning & Hierarchical Graph Memory

MemDreamer decouples perception and reasoning, turning it into incremental agent exploration: building memory while watching the video, and actively retrieving during reasoning. The hierarchical graph memory has a three-layer architecture: the base layer (spatiotemporal causal graph, capturing event/object relationships), the middle layer (semantic clustering, organizing similar events), and the top layer (global summary, grasping the overall theme).

4

Section 04

Agent-based Retrieval Mechanism: Observation-Reasoning-Action Loop

The reasoning phase uses tool-augmented agent-based retrieval, implemented through a loop: Observation (question + retrieved information) → Reasoning (decide next retrieval content, such as navigating memory layers or searching nodes) → Action (execute retrieval operations, such as jumping to time points or querying events), gradually focusing on key information.

5

Section 05

Experimental Evidence: SOTA Performance and Efficiency Breakthroughs

MemDreamer achieved SOTA on four mainstream benchmark tests: accuracy increased by 12.5 percentage points, the gap with human experts narrowed to 3.7 points; only uses a 2% context window (e.g., 1.2 minutes of content for an hour-long video); found that logical reasoning ability is positively correlated with long video understanding, establishing agent expansion as a new multi-modal paradigm.

6

Section 06

Application Scenarios and Potential Impact

MemDreamer can be applied to: video surveillance (real-time analysis of abnormal events), content creation (extracting key clips from materials), education and training (quickly locating knowledge points), healthcare (analyzing medical imaging records), and scientific research (processing experiment/observation videos).

7

Section 07

Limitations and Future Outlook

Limitations: Hierarchical graph construction has computational overhead, and currently only focuses on visual information. Future directions: optimize graph construction algorithms, explore unsupervised memory learning, expand multi-modal scenarios, and improve agent decision-making capabilities.

8

Section 08

Conclusion: Technical Value and Prospects

MemDreamer addresses the core issues of long video understanding through decoupling perception and reasoning, hierarchical graph memory, and agent-based retrieval, achieving SOTA with only 2% context. This achievement opens up prospects for the practical application of VLMs and is expected to drive more innovative applications in the future.