# Third Place in CVPR 2026 CASTLE Challenge: Agent-based Multi-view Long Video Understanding via Hierarchical Knowledge Graph Retrieval

> This article introduces the third-place solution of the CVPR 2026 CASTLE Challenge, proposing a training-free agent framework that achieves efficient long-context video understanding on over 600 hours of multi-view video data through video knowledge graphs and hierarchical retrieval mechanisms.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T09:01:32.000Z
- 最近活动: 2026-06-02T04:52:14.004Z
- 热度: 118.2
- 关键词: 长视频理解, 知识图谱, 智能体, 多视角视频, 零样本学习, CVPR
- 页面链接: https://www.zingnex.cn/en/forum/thread/cvpr-2026-castle
- Canonical: https://www.zingnex.cn/forum/thread/cvpr-2026-castle
- Markdown 来源: floors_fallback

---

## Introduction: Core Overview of the Third-Place Solution for CVPR 2026 CASTLE Challenge

This article presents the third-place solution for the CVPR 2026 CASTLE Challenge, proposing a training-free agent framework that achieves efficient long-context understanding on over 600 hours of multi-view video data via video knowledge graphs and hierarchical retrieval mechanisms. The solution combines structured representation of knowledge graphs with adaptive agent workflows, featuring zero-shot generalization capability and interpretability.

## Challenge Background: Difficulties in Extreme-Scale Multi-view Video Understanding

The CASTLE Challenge is designed for large-scale, multi-modal, long-context video streams. The dataset includes 15 perspectives (first/third person) and over 600 hours of synchronized recordings, requiring solutions to complex problems such as visual counting, action localization, multi-view tracking, and speaker temporal reasoning, with the need to integrate cross-time/view information for spatiotemporal reasoning.

## Core Methods: Video Knowledge Graph and Hierarchical Retrieval Agent

**Video Knowledge Graph**: Abstracts static entities (fixed objects/permanent persons), dynamic entities (moving objects/temporary persons), temporal/spatial relationships, and cross events, supporting multi-hop reasoning;
**Hierarchical Retrieval by Agent**: Global index rough screening → local graph detailed inspection → multi-modal verification, with adaptive strategy adjustment;
**Training-Free Design**: Based on pre-trained vision-language models, enabling zero-shot generalization, efficient deployment, and strong interpretability.

## Experimental Results: Performance and Analysis

The system won third place in the challenge, performing excellently in cross-view reasoning, long temporal dependency, and complex query problems; limitations include bottlenecks in fine-grained visual recognition and knowledge graph construction relying on the accuracy of detection and tracking.

## Domain Insights and Future Directions

**Insights**: Structured representation (graphs) outperforms embeddings, retrieval-augmented generation (RAG) is effective in the video domain, and agent architecture has significant value;
**Limitations**: Insufficient automation in knowledge graph construction, high computational resource demands, and generalization boundaries to be explored;
**Future Directions**: Automated graph construction, efficiency optimization, and enhancement of generalization capabilities.

## Conclusion: Value and Outlook of the Solution

This solution provides an effective path for extreme-scale multi-view long video understanding, with training-free design and structured reasoning capabilities as core advantages. Open-source code will promote domain progress; we look forward to its verification and improvement in more scenarios, helping video understanding develop toward longer contexts and more complex reasoning.
