Zing Forum

Reading

D-SAT: A Causal World Model That Teaches AI to Understand 'Why' Instead of Just 'What'

The D-SAT project builds a dynamic scene-action transformer capable of understanding causal relationships in videos through three phases of work, using Gemma 3 and LoRA technology to enable scene graph-to-scene graph causal reasoning.

因果推理世界模型视频理解Gemma 3LoRA场景图反事实训练大语言模型参数高效微调视觉-语言模型
Published 2026-06-02 01:12Recent activity 2026-06-02 01:19Estimated read 6 min
D-SAT: A Causal World Model That Teaches AI to Understand 'Why' Instead of Just 'What'
1

Section 01

D-SAT Project Overview: Building AI That Understands Causal Relationships in Videos

Project Source

Core Idea

D-SAT (Dynamic Scene-Action Transformer) aims to teach AI to understand 'why' (causal relationships) instead of just 'what' in videos. It builds a causal world model via three phases, using Gemma 3 and LoRA for scene graph-to-scene graph causal reasoning, plus counterfactual training to enhance causal understanding.

2

Section 02

Project Background & Motivation

Current video understanding models have critical limitations:

  • Action recognition models identify action types (e.g., 'cutting') but ignore executors, objects, and changes.
  • Scene graph generators capture static spatial relationships but not temporal evolution.
  • Visual-Language Models (VLMs) generate descriptions but lack explicit causal reasoning (can't answer 'what if' questions).

D-SAT's goal is to learn a state transition function: given current scene graph Gₜ and an action, predict next scene graph Gₜ₊₁.

3

Section 03

Technical Architecture Overview

D-SAT has three core components:

  1. Perception Module (Frozen)
    • Uses pre-trained DINOv2 ViT backbone + graph generation head to convert video frames into structured JSON scene graphs (no training here).
  2. Causal Transition Model (Trainable)
    • Core component: Gemma 3 model fine-tuned with LoRA (parameter-efficient). Inputs current scene graph + action text, outputs predicted next scene graph. Trained with cross-entropy loss.
  3. Counterfactual Reasoning Layer
    • Post-basic training: fine-tune on curated counterfactual examples to shift from pattern matching to true causal understanding.
4

Section 04

Phases 1 & 2: Data Pipeline & Model Training

Phase1: Automated Causal Dataset Generation

  • Source: YouCook2 dataset (414 videos, 3180 subtitled clips).
  • Steps: Load annotations → download video clips (yt-dlp) → extract start/end frames (ffmpeg) → Gemini 2.0 Flash generate Gₜ/Gₜ₊₁ → filter inconsistent triplets → output triplets.jsonl.

Phase2: Causal Model Training

  • Base model: Gemma3 (2B instruction version).
  • Training: Use peft library for LoRA fine-tuning on A100 GPU, cross-entropy loss for sequence prediction.
  • Evaluation: Graph Edit Distance (GED) on holdout set.
  • Output: lora_adapter/ (model checkpoints).
5

Section 05

Phase3: Counterfactual Fine-tuning (Key Differentiator)

This phase tests if the model truly understands causality:

  • Load Phase2's best checkpoint.
  • Fine-tune on curated counterfactual examples (e.g., same start scene but 'add salt' vs 'add sugar' should yield different results).
  • Evaluation: Check both counterfactual accuracy and original GED to avoid performance degradation.
  • Output: lora_adapter_cf/ (causal-aware model checkpoints).
6

Section 06

Future Plans for D-SAT

Four more phases to complete the end-to-end system:

  1. Phase4: Expand dataset (full YouCook2 + other video datasets).
  2. Phase5: Full training & comprehensive evaluation with expanded data.
  3. Phase6: Connect frozen perception module to causal model for end-to-end video inference.
  4. Phase7: Build interactive demo & write final report.
7

Section 07

Technical Highlights & Significance

Key Highlights

  • Combines LLM reasoning (Gemma3), parameter-efficient fine-tuning (LoRA), and counterfactual training.
  • Shifts AI from pattern recognition to causal understanding.

Significance

  • Addresses a core gap in AI: moving beyond correlation to causation.
  • Paves the way for more reliable, explainable AI systems.
  • Raises critical questions: What does it mean for AI to 'understand' the world? (Deep principles vs surface patterns).