# MemDreamer: Long Video Understanding via Hierarchical Graph Memory and Agent-based Retrieval Mechanism

> MemDreamer decouples perception and reasoning, adopts a hierarchical graph memory architecture and an agent-based retrieval mechanism, transforms long video understanding into an exploration process, and achieves SOTA performance while using only 2% of the context.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-05T17:59:21.000Z
- 最近活动: 2026-06-08T03:22:39.441Z
- 热度: 96.6
- 关键词: 长视频理解, 视觉语言模型, 分层图记忆, 智能体检索, 感知推理解耦
- 页面链接: https://www.zingnex.cn/en/forum/thread/memdreamer
- Canonical: https://www.zingnex.cn/forum/thread/memdreamer
- Markdown 来源: floors_fallback

---

## MemDreamer: A Groundbreaking Solution for Long Video Understanding

MemDreamer is an innovative solution for long video understanding. Its core lies in decoupling perception and reasoning, adopting a hierarchical graph memory architecture and an agent-based retrieval mechanism, and transforming long video understanding into an agent exploration process. This solution achieves SOTA performance while using only 2% of the context, effectively addressing the token explosion and attention dilution issues in long video processing.

## Core Challenges in Long Video Understanding

Current Vision-Language Models (VLMs) perform well in processing short videos, but face token explosion and attention dilution issues when dealing with hour-long videos: an hour-long video contains hundreds of thousands or even millions of frames, making full-input computation extremely costly, and models struggle to focus on key information, limiting practical applications such as surveillance analysis and documentary understanding.

## Core Methods: Decoupling Perception and Reasoning & Hierarchical Graph Memory

MemDreamer decouples perception and reasoning, turning it into incremental agent exploration: building memory while watching the video, and actively retrieving during reasoning. The hierarchical graph memory has a three-layer architecture: the base layer (spatiotemporal causal graph, capturing event/object relationships), the middle layer (semantic clustering, organizing similar events), and the top layer (global summary, grasping the overall theme).

## Agent-based Retrieval Mechanism: Observation-Reasoning-Action Loop

The reasoning phase uses tool-augmented agent-based retrieval, implemented through a loop: Observation (question + retrieved information) → Reasoning (decide next retrieval content, such as navigating memory layers or searching nodes) → Action (execute retrieval operations, such as jumping to time points or querying events), gradually focusing on key information.

## Experimental Evidence: SOTA Performance and Efficiency Breakthroughs

MemDreamer achieved SOTA on four mainstream benchmark tests: accuracy increased by 12.5 percentage points, the gap with human experts narrowed to 3.7 points; only uses a 2% context window (e.g., 1.2 minutes of content for an hour-long video); found that logical reasoning ability is positively correlated with long video understanding, establishing agent expansion as a new multi-modal paradigm.

## Application Scenarios and Potential Impact

MemDreamer can be applied to: video surveillance (real-time analysis of abnormal events), content creation (extracting key clips from materials), education and training (quickly locating knowledge points), healthcare (analyzing medical imaging records), and scientific research (processing experiment/observation videos).

## Limitations and Future Outlook

Limitations: Hierarchical graph construction has computational overhead, and currently only focuses on visual information. Future directions: optimize graph construction algorithms, explore unsupervised memory learning, expand multi-modal scenarios, and improve agent decision-making capabilities.

## Conclusion: Technical Value and Prospects

MemDreamer addresses the core issues of long video understanding through decoupling perception and reasoning, hierarchical graph memory, and agent-based retrieval, achieving SOTA with only 2% context. This achievement opens up prospects for the practical application of VLMs and is expected to drive more innovative applications in the future.
