Zing Forum

Reading

EgoCoT-Bench: A New Verifiable Reasoning Benchmark for First-Person View Video Understanding

This article introduces EgoCoT-Bench, a verifiable benchmark for fine-grained action reasoning in first-person view videos using multimodal large language models. It contains 3172 QA pairs and step-by-step reasoning annotations, revealing key flaws in current models regarding evidence consistency.

多模态大语言模型第一人称视角视频理解思维链推理可验证推理时空场景图细粒度推理操作中心推理
Published 2026-05-19 17:02Recent activity 2026-05-20 11:20Estimated read 6 min
EgoCoT-Bench: A New Verifiable Reasoning Benchmark for First-Person View Video Understanding
1

Section 01

Introduction: EgoCoT-Bench—A New Verifiable Reasoning Benchmark for First-Person Video Understanding

This article introduces EgoCoT-Bench, a verifiable benchmark for fine-grained action reasoning in first-person view videos using multimodal large language models. It includes 3172 QA pairs and step-by-step reasoning annotations, revealing key flaws in current models regarding evidence consistency. This benchmark emphasizes the verifiability of reasoning processes and provides a tool to evaluate the true understanding capabilities of models.

2

Section 02

Research Background: Challenges in First-Person Video Understanding and Flaws of Existing Benchmarks

With the development of multimodal large language models, first-person view video understanding has gained attention. However, existing benchmarks lack fine-grained evaluation of reasoning bases and rarely check whether explanations align with spatiotemporal evidence, leading to cases where models may give correct answers but have untenable reasoning.

3

Section 03

EgoCoT-Bench Benchmark: Data Scale and Detailed Explanation of Four Task Groups

EgoCoT-Bench contains 351 first-person videos and 3172 verifiable QA pairs, divided into 4 major task groups (12 subtasks):

  1. Perception and Retrospection: Understand actions that have occurred, such as retracing event sequences;
  2. Prediction: Infer future events to test causal reasoning;
  3. High-level Reasoning: Abstract understanding (e.g., action purposes, anomaly detection); It covers scenarios like perception and retrospection, prediction, and high-level reasoning.
4

Section 04

Data Construction: Spatiotemporal Scene Graph-Guided Generation Framework and Step-by-Step Reasoning Annotations

Data construction uses a Spatiotemporal Scene Graph (STSG)-guided framework:

  1. Scene Graph Extraction: Extract object and action nodes as well as spatiotemporal relationships from videos;
  2. Question Generation: Automatically generate candidate questions with clear spatiotemporal bases based on the scene graph;
  3. Manual Refinement: Review to ensure correct answers, perspective relevance, and fine-grained quality; In addition, each question provides explicit step-by-step reasoning annotations to check whether the reasoning chain is based on evidence.
5

Section 05

Experimental Findings: Issues of Correct Answers but Unreliable Reasoning in Models

Evaluations of cutting-edge models reveal:

  1. Fine-grained reasoning remains challenging: It is difficult to track details of hand-object interactions and perceive changes in object states;
  2. Evidence inconsistency: Correct answers but inconsistent reasoning evidence, such as spatiotemporal positioning errors, causal confusion, and ignoring contradictory evidence.
6

Section 06

Research Significance: Promoting Verifiable Reasoning and Standardized Evaluation

The significance of EgoCoT-Bench:

  1. Promote research on verifiable reasoning and provide a tool to test the true understanding of models;
  2. Reveal evaluation blind spots: Focusing only on answer accuracy is insufficient; the reasoning process needs to be verified;
  3. Facilitate the technical development of first-person view applications (e.g., assistive robots, smart homes).
7

Section 07

Limitations and Future Directions: Expanding Data and Automatic Evaluation Tools

Limitations: Insufficient data scale (351 videos), limited domain coverage (mainly daily scenarios), and reliance on manual evaluation; Future directions: Expand data scale, develop automatic reasoning verification tools, conduct cross-domain transfer research, and explore real-time reasoning capabilities.

8

Section 08

Conclusion: EgoCoT-Bench Sets a New Standard for First-Person Video Understanding Evaluation

EgoCoT-Bench emphasizes verifiable action-centric reasoning, reveals the limitations of current models, and points out directions for future research. Only when reasoning is based on evidence can AI systems be reliably applied in the real world.