# D-SAT: A Causal World Model That Teaches AI to Understand 'Why' Instead of Just 'What'

> The D-SAT project builds a dynamic scene-action transformer capable of understanding causal relationships in videos through three phases of work, using Gemma 3 and LoRA technology to enable scene graph-to-scene graph causal reasoning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T17:12:09.000Z
- 最近活动: 2026-06-01T17:19:06.771Z
- 热度: 154.9
- 关键词: 因果推理, 世界模型, 视频理解, Gemma 3, LoRA, 场景图, 反事实训练, 大语言模型, 参数高效微调, 视觉-语言模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/d-sat-ai
- Canonical: https://www.zingnex.cn/forum/thread/d-sat-ai
- Markdown 来源: floors_fallback

---

## D-SAT Project Overview: Building AI That Understands Causal Relationships in Videos

### Project Source
- Author/Maintainer: engineer-nithura
- Source Platform: GitHub
- Original Title: D-SAT-Phases-1-3-Data-Pipeline-Causal-Model-Training-Counterfactual-Fine-tuning
- Link: https://github.com/engineer-nithura/D-SAT-Phases-1-3-Data-Pipeline-Causal-Model-Training-Counterfactual-Fine-tuning
- Release/Update Time: 2026-06-01T17:12:09Z

### Core Idea
D-SAT (Dynamic Scene-Action Transformer) aims to teach AI to understand 'why' (causal relationships) instead of just 'what' in videos. It builds a causal world model via three phases, using Gemma 3 and LoRA for scene graph-to-scene graph causal reasoning, plus counterfactual training to enhance causal understanding.

## Project Background & Motivation

Current video understanding models have critical limitations:
- Action recognition models identify action types (e.g., 'cutting') but ignore executors, objects, and changes.
- Scene graph generators capture static spatial relationships but not temporal evolution.
- Visual-Language Models (VLMs) generate descriptions but lack explicit causal reasoning (can't answer 'what if' questions).

D-SAT's goal is to learn a state transition function: given current scene graph Gₜ and an action, predict next scene graph Gₜ₊₁.

## Technical Architecture Overview

D-SAT has three core components:
1. **Perception Module (Frozen)**
   - Uses pre-trained DINOv2 ViT backbone + graph generation head to convert video frames into structured JSON scene graphs (no training here).
2. **Causal Transition Model (Trainable)**
   - Core component: Gemma 3 model fine-tuned with LoRA (parameter-efficient). Inputs current scene graph + action text, outputs predicted next scene graph. Trained with cross-entropy loss.
3. **Counterfactual Reasoning Layer**
   - Post-basic training: fine-tune on curated counterfactual examples to shift from pattern matching to true causal understanding.

## Phases 1 & 2: Data Pipeline & Model Training

#### Phase1: Automated Causal Dataset Generation
- Source: YouCook2 dataset (414 videos, 3180 subtitled clips).
- Steps: Load annotations → download video clips (yt-dlp) → extract start/end frames (ffmpeg) → Gemini 2.0 Flash generate Gₜ/Gₜ₊₁ → filter inconsistent triplets → output triplets.jsonl.

#### Phase2: Causal Model Training
- Base model: Gemma3 (2B instruction version).
- Training: Use peft library for LoRA fine-tuning on A100 GPU, cross-entropy loss for sequence prediction.
- Evaluation: Graph Edit Distance (GED) on holdout set.
- Output: lora_adapter/ (model checkpoints).

## Phase3: Counterfactual Fine-tuning (Key Differentiator)

This phase tests if the model truly understands causality:
- Load Phase2's best checkpoint.
- Fine-tune on curated counterfactual examples (e.g., same start scene but 'add salt' vs 'add sugar' should yield different results).
- Evaluation: Check both counterfactual accuracy and original GED to avoid performance degradation.
- Output: lora_adapter_cf/ (causal-aware model checkpoints).

## Future Plans for D-SAT

Four more phases to complete the end-to-end system:
1. Phase4: Expand dataset (full YouCook2 + other video datasets).
2. Phase5: Full training & comprehensive evaluation with expanded data.
3. Phase6: Connect frozen perception module to causal model for end-to-end video inference.
4. Phase7: Build interactive demo & write final report.

## Technical Highlights & Significance

### Key Highlights
- Combines LLM reasoning (Gemma3), parameter-efficient fine-tuning (LoRA), and counterfactual training.
- Shifts AI from pattern recognition to causal understanding.

### Significance
- Addresses a core gap in AI: moving beyond correlation to causation.
- Paves the way for more reliable, explainable AI systems.
- Raises critical questions: What does it mean for AI to 'understand' the world? (Deep principles vs surface patterns).
