Zing Forum

Reading

Third Place in CVPR 2026 CASTLE Challenge: Agent-based Multi-view Long Video Understanding via Hierarchical Knowledge Graph Retrieval

This article introduces the third-place solution of the CVPR 2026 CASTLE Challenge, proposing a training-free agent framework that achieves efficient long-context video understanding on over 600 hours of multi-view video data through video knowledge graphs and hierarchical retrieval mechanisms.

长视频理解知识图谱智能体多视角视频零样本学习CVPR
Published 2026-06-01 17:01Recent activity 2026-06-02 12:52Estimated read 5 min
Third Place in CVPR 2026 CASTLE Challenge: Agent-based Multi-view Long Video Understanding via Hierarchical Knowledge Graph Retrieval
1

Section 01

Introduction: Core Overview of the Third-Place Solution for CVPR 2026 CASTLE Challenge

This article presents the third-place solution for the CVPR 2026 CASTLE Challenge, proposing a training-free agent framework that achieves efficient long-context understanding on over 600 hours of multi-view video data via video knowledge graphs and hierarchical retrieval mechanisms. The solution combines structured representation of knowledge graphs with adaptive agent workflows, featuring zero-shot generalization capability and interpretability.

2

Section 02

Challenge Background: Difficulties in Extreme-Scale Multi-view Video Understanding

The CASTLE Challenge is designed for large-scale, multi-modal, long-context video streams. The dataset includes 15 perspectives (first/third person) and over 600 hours of synchronized recordings, requiring solutions to complex problems such as visual counting, action localization, multi-view tracking, and speaker temporal reasoning, with the need to integrate cross-time/view information for spatiotemporal reasoning.

3

Section 03

Core Methods: Video Knowledge Graph and Hierarchical Retrieval Agent

Video Knowledge Graph: Abstracts static entities (fixed objects/permanent persons), dynamic entities (moving objects/temporary persons), temporal/spatial relationships, and cross events, supporting multi-hop reasoning; Hierarchical Retrieval by Agent: Global index rough screening → local graph detailed inspection → multi-modal verification, with adaptive strategy adjustment; Training-Free Design: Based on pre-trained vision-language models, enabling zero-shot generalization, efficient deployment, and strong interpretability.

4

Section 04

Experimental Results: Performance and Analysis

The system won third place in the challenge, performing excellently in cross-view reasoning, long temporal dependency, and complex query problems; limitations include bottlenecks in fine-grained visual recognition and knowledge graph construction relying on the accuracy of detection and tracking.

5

Section 05

Domain Insights and Future Directions

Insights: Structured representation (graphs) outperforms embeddings, retrieval-augmented generation (RAG) is effective in the video domain, and agent architecture has significant value; Limitations: Insufficient automation in knowledge graph construction, high computational resource demands, and generalization boundaries to be explored; Future Directions: Automated graph construction, efficiency optimization, and enhancement of generalization capabilities.

6

Section 06

Conclusion: Value and Outlook of the Solution

This solution provides an effective path for extreme-scale multi-view long video understanding, with training-free design and structured reasoning capabilities as core advantages. Open-source code will promote domain progress; we look forward to its verification and improvement in more scenarios, helping video understanding develop toward longer contexts and more complex reasoning.