# EgoCoT-Bench: A New Verifiable Reasoning Benchmark for First-Person View Video Understanding

> This article introduces EgoCoT-Bench, a verifiable benchmark for fine-grained action reasoning in first-person view videos using multimodal large language models. It contains 3172 QA pairs and step-by-step reasoning annotations, revealing key flaws in current models regarding evidence consistency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T09:02:20.000Z
- 最近活动: 2026-05-20T03:20:14.198Z
- 热度: 141.7
- 关键词: 多模态大语言模型, 第一人称视角, 视频理解, 思维链推理, 可验证推理, 时空场景图, 细粒度推理, 操作中心推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/egocot-bench
- Canonical: https://www.zingnex.cn/forum/thread/egocot-bench
- Markdown 来源: floors_fallback

---

## Introduction: EgoCoT-Bench—A New Verifiable Reasoning Benchmark for First-Person Video Understanding

This article introduces EgoCoT-Bench, a verifiable benchmark for fine-grained action reasoning in first-person view videos using multimodal large language models. It includes 3172 QA pairs and step-by-step reasoning annotations, revealing key flaws in current models regarding evidence consistency. This benchmark emphasizes the verifiability of reasoning processes and provides a tool to evaluate the true understanding capabilities of models.

## Research Background: Challenges in First-Person Video Understanding and Flaws of Existing Benchmarks

With the development of multimodal large language models, first-person view video understanding has gained attention. However, existing benchmarks lack fine-grained evaluation of reasoning bases and rarely check whether explanations align with spatiotemporal evidence, leading to cases where models may give correct answers but have untenable reasoning.

## EgoCoT-Bench Benchmark: Data Scale and Detailed Explanation of Four Task Groups

EgoCoT-Bench contains 351 first-person videos and 3172 verifiable QA pairs, divided into 4 major task groups (12 subtasks):
1. Perception and Retrospection: Understand actions that have occurred, such as retracing event sequences;
2. Prediction: Infer future events to test causal reasoning;
3. High-level Reasoning: Abstract understanding (e.g., action purposes, anomaly detection);
It covers scenarios like perception and retrospection, prediction, and high-level reasoning.

## Data Construction: Spatiotemporal Scene Graph-Guided Generation Framework and Step-by-Step Reasoning Annotations

Data construction uses a Spatiotemporal Scene Graph (STSG)-guided framework:
1. Scene Graph Extraction: Extract object and action nodes as well as spatiotemporal relationships from videos;
2. Question Generation: Automatically generate candidate questions with clear spatiotemporal bases based on the scene graph;
3. Manual Refinement: Review to ensure correct answers, perspective relevance, and fine-grained quality;
In addition, each question provides explicit step-by-step reasoning annotations to check whether the reasoning chain is based on evidence.

## Experimental Findings: Issues of Correct Answers but Unreliable Reasoning in Models

Evaluations of cutting-edge models reveal:
1. Fine-grained reasoning remains challenging: It is difficult to track details of hand-object interactions and perceive changes in object states;
2. Evidence inconsistency: Correct answers but inconsistent reasoning evidence, such as spatiotemporal positioning errors, causal confusion, and ignoring contradictory evidence.

## Research Significance: Promoting Verifiable Reasoning and Standardized Evaluation

The significance of EgoCoT-Bench:
1. Promote research on verifiable reasoning and provide a tool to test the true understanding of models;
2. Reveal evaluation blind spots: Focusing only on answer accuracy is insufficient; the reasoning process needs to be verified;
3. Facilitate the technical development of first-person view applications (e.g., assistive robots, smart homes).

## Limitations and Future Directions: Expanding Data and Automatic Evaluation Tools

Limitations: Insufficient data scale (351 videos), limited domain coverage (mainly daily scenarios), and reliance on manual evaluation;
Future directions: Expand data scale, develop automatic reasoning verification tools, conduct cross-domain transfer research, and explore real-time reasoning capabilities.

## Conclusion: EgoCoT-Bench Sets a New Standard for First-Person Video Understanding Evaluation

EgoCoT-Bench emphasizes verifiable action-centric reasoning, reveals the limitations of current models, and points out directions for future research. Only when reasoning is based on evidence can AI systems be reliably applied in the real world.
