# CausalDriveBench: A Causal Reasoning Evaluation Benchmark and Dataset Construction Framework for Autonomous Driving

> A comprehensive benchmark for evaluating the causal reasoning capabilities of vision-language-action models in autonomous driving scenarios, supporting the nuScenes, OpenScene, and Argoverse V2 datasets, and providing a complete pipeline from raw data to causal scene graphs, question-answer pairs, and counterfactual trajectories.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-01T18:39:29.000Z
- 最近活动: 2026-05-01T18:53:29.822Z
- 热度: 145.8
- 关键词: 自动驾驶, 因果推理, 视觉语言动作模型, nuScenes, OpenScene, Argoverse, 基准测试, 反事实轨迹, 因果场景图, ECCV2024
- 页面链接: https://www.zingnex.cn/en/forum/thread/causaldrivebench
- Canonical: https://www.zingnex.cn/forum/thread/causaldrivebench
- Markdown 来源: floors_fallback

---

## [Introduction] CausalDriveBench: Project Overview of Causal Reasoning Evaluation Benchmark for Autonomous Driving

CausalDriveBench is a causal reasoning evaluation benchmark for vision-language-action (VLA) models in autonomous driving, supporting three mainstream datasets: nuScenes, OpenScene, and Argoverse V2. It provides a complete construction pipeline from raw data to causal scene graphs, question-answer pairs, and counterfactual trajectories, aiming to fill the gap in causal reasoning evaluation in the autonomous driving field.

## Project Background and Research Motivation

The safety of autonomous driving systems depends not only on the accuracy of perception and planning but more crucially on understanding the causal relationships between scene elements. Current end-to-end VLA models perform well in regular scenarios but often struggle when facing complex situations that require deep causal reasoning. CausalDriveBench is a research project born to address this evaluation gap.

## Core Capabilities and Technical Architecture

### Supported Datasets
- nuScenes: 120 scenes, ~4 samples per scene
- OpenScene (NAVSIM): 100 scenes, using quartile sampling
- Argoverse V2: 133 scenes, 5-camera configuration

### Six-Stage Pipeline
1. **Record Construction**: Convert raw data into a unified structure containing BEV rendering, multi-view images, agent states, etc.
2. **Causal Scene Graph Generation**: Use multi-modal LLMs to generate structured graphs with 5 node types, multiple edge types, and causal states.
3. **Graph Pruning**: Reverse BFS algorithm to remove interfering nodes with no causal path to the ego vehicle.
4. **Causal Ladder QA**: Generate three types of questions (active edges, dormant nodes, interfering nodes) based on Pearl's theory.
5. **Counterfactual Trajectory Generation**: Generate counterfactual scenarios such as agent intervention and infrastructure intervention for specific questions.
6. **LLM Ego Vehicle Trajectory Prediction**: Predict ego vehicle trajectory based on intervention configurations; optional nuPlan simulator can be used as an alternative.

## Technical Implementation Details

- **Batch API Cost Optimization**: Use Claude Batch API, with a single sample processing cost of ~$0.16-$0.25
- **Dynamic Camera Sorting**: Dynamically construct IMAGE_ORDER_BLOCK for camera differences across datasets, no need for multiple prompt sets
- **Visibility Filtering**: Apply multi-ray 3D ray casting to filter occluded vehicles in nuScenes data
- **Image Size Adaptation**: Automatically adjust image size when AV2 camera images exceed Claude's limits

## Visualization and Validation Tools

- **Interactive Visualization**: HTML tool based on D3.js, which can render causal graphs, overlay camera images and BEV, display QA cards, and support switching between original and pruned graphs
- **Validation Script**: Graph post-processing script for manual review and correction, generating `{scene_id}_verified.json` as the standard graph

## Research Value and Application Prospects

CausalDriveBench fills the gap in causal reasoning evaluation in the autonomous driving field and can be used for:
1. Quantitatively compare the causal reasoning levels of different VLA models
2. Identify model failure points in causal scenarios
3. Expand training sets based on counterfactual trajectories to improve robustness
4. Understand model decision-making basis through causal graph visualization

This benchmark promotes the paradigm shift of autonomous driving from 'pattern recognition' to 'causal understanding'.
