Reading

D-SAT: A Causal World Model That Teaches AI to Understand 'Why' Instead of Just 'What'

The D-SAT project builds a dynamic scene-action transformer capable of understanding causal relationships in videos through three phases of work, using Gemma 3 and LoRA technology to enable scene graph-to-scene graph causal reasoning.

因果推理世界模型视频理解Gemma 3LoRA场景图反事实训练大语言模型参数高效微调视觉-语言模型

Published 2026-06-02 01:12Recent activity 2026-06-02 01:19Estimated read 6 min

D-SAT: A Causal World Model That Teaches AI to Understand 'Why' Instead of Just 'What'

Section 01

D-SAT Project Overview: Building AI That Understands Causal Relationships in Videos

Project Source

Author/Maintainer: engineer-nithura
Source Platform: GitHub
Original Title: D-SAT-Phases-1-3-Data-Pipeline-Causal-Model-Training-Counterfactual-Fine-tuning
Link: https://github.com/engineer-nithura/D-SAT-Phases-1-3-Data-Pipeline-Causal-Model-Training-Counterfactual-Fine-tuning
Release/Update Time: 2026-06-01T17:12:09Z

Core Idea

D-SAT (Dynamic Scene-Action Transformer) aims to teach AI to understand 'why' (causal relationships) instead of just 'what' in videos. It builds a causal world model via three phases, using Gemma 3 and LoRA for scene graph-to-scene graph causal reasoning, plus counterfactual training to enhance causal understanding.

Section 02

Project Background & Motivation

Current video understanding models have critical limitations:

Action recognition models identify action types (e.g., 'cutting') but ignore executors, objects, and changes.
Scene graph generators capture static spatial relationships but not temporal evolution.
Visual-Language Models (VLMs) generate descriptions but lack explicit causal reasoning (can't answer 'what if' questions).

D-SAT's goal is to learn a state transition function: given current scene graph Gₜ and an action, predict next scene graph Gₜ₊₁.

Section 03

Technical Architecture Overview

D-SAT has three core components:

Perception Module (Frozen)
- Uses pre-trained DINOv2 ViT backbone + graph generation head to convert video frames into structured JSON scene graphs (no training here).
Causal Transition Model (Trainable)
- Core component: Gemma 3 model fine-tuned with LoRA (parameter-efficient). Inputs current scene graph + action text, outputs predicted next scene graph. Trained with cross-entropy loss.
Counterfactual Reasoning Layer
- Post-basic training: fine-tune on curated counterfactual examples to shift from pattern matching to true causal understanding.

Section 04

Phases 1 & 2: Data Pipeline & Model Training

Phase1: Automated Causal Dataset Generation

Source: YouCook2 dataset (414 videos, 3180 subtitled clips).
Steps: Load annotations → download video clips (yt-dlp) → extract start/end frames (ffmpeg) → Gemini 2.0 Flash generate Gₜ/Gₜ₊₁ → filter inconsistent triplets → output triplets.jsonl.

Phase2: Causal Model Training

Base model: Gemma3 (2B instruction version).
Training: Use peft library for LoRA fine-tuning on A100 GPU, cross-entropy loss for sequence prediction.
Evaluation: Graph Edit Distance (GED) on holdout set.
Output: lora_adapter/ (model checkpoints).

Section 05

Phase3: Counterfactual Fine-tuning (Key Differentiator)

This phase tests if the model truly understands causality:

Load Phase2's best checkpoint.
Fine-tune on curated counterfactual examples (e.g., same start scene but 'add salt' vs 'add sugar' should yield different results).
Evaluation: Check both counterfactual accuracy and original GED to avoid performance degradation.
Output: lora_adapter_cf/ (causal-aware model checkpoints).

Section 06

Future Plans for D-SAT

Four more phases to complete the end-to-end system:

Phase4: Expand dataset (full YouCook2 + other video datasets).
Phase5: Full training & comprehensive evaluation with expanded data.
Phase6: Connect frozen perception module to causal model for end-to-end video inference.
Phase7: Build interactive demo & write final report.

Section 07

Technical Highlights & Significance

Key Highlights

Combines LLM reasoning (Gemma3), parameter-efficient fine-tuning (LoRA), and counterfactual training.
Shifts AI from pattern recognition to causal understanding.

Significance

Addresses a core gap in AI: moving beyond correlation to causation.
Paves the way for more reliable, explainable AI systems.
Raises critical questions: What does it mean for AI to 'understand' the world? (Deep principles vs surface patterns).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15