# dt-agent: A Multimodal Agent-Based Automatic Digital Twin Construction System

> A proof-of-concept project demonstrating how to leverage collaboration between large language models and vision-language models to automatically convert text specifications into actionable digital twin scenes in NVIDIA Isaac Sim via a closed-loop process of planning-editing-execution-observation-reflection.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T20:15:25.000Z
- 最近活动: 2026-05-13T20:20:21.604Z
- 热度: 157.9
- 关键词: 数字孪生, Isaac Sim, 多模态智能体, 视觉语言模型, 3D仿真, 机器人, USD
- 页面链接: https://www.zingnex.cn/en/forum/thread/dt-agent
- Canonical: https://www.zingnex.cn/forum/thread/dt-agent
- Markdown 来源: floors_fallback

---

## dt-agent: Guide to the Multimodal Agent-Based Automatic Digital Twin Construction System

dt-agent is a proof-of-concept project aimed at demonstrating how to use collaboration between large language models and vision-language models to automatically convert text specifications into actionable digital twin scenes in NVIDIA Isaac Sim through a closed-loop process of planning → editing → execution → observation → reflection. The core goal of this project is to lower the barrier to using digital twin technology, enabling non-professionals to generate complex 3D simulation scenes via natural language descriptions.

## Project Background and Vision

Digital twin technology is playing an increasingly important role in industrial simulation, robot training, and automated testing. However, building complex scenes requires professional 3D modeling knowledge and tedious manual configuration. dt-agent proposes a new approach: using multimodal large model agents to automatically generate actionable digital twin scenes via natural language descriptions, significantly lowering the barrier to use.

## Core Architecture Design and Technical Implementation Details

### Core Architecture Design
- **Agent Loop Mechanism**: Adopt an iterative closed-loop of planning → editing → execution → observation → reflection, adjusting and optimizing based on previous results.
- **Dual-Model Collaboration**: The planner and encoder use GPT-5.3-codex (via NVIDIA Inference Agent) for high-level planning and code generation; the observer uses a self-hosted Cosmos Reason 2 8B vision-language model to understand scene rendering images.
- **Simulation Environment**: The underlying layer is based on NVIDIA Isaac Sim 5.1.0, interacting via standard library HTTP RPC interfaces to avoid dependency conflicts.

### Technical Implementation Details
- **Scene Construction**: Support creating geometries, adding USD references, setting transformations, saving scenes, and searching for preset assets from the NVIDIA OpenUSD CDN.
- **Visual Observation and Verification**: Cosmos Reason 2 8B receives rendered images and outputs intent satisfaction, observation results, problem identification, and correction suggestions.
- **Tool Call Interfaces**: Provide HTTP POST tools for scene querying, USD operations, transformation settings, asset search, viewport capture, etc., returning structured JSON results.

## Typical Application Scenarios

### Industrial Workbench Simulation
The demo case builds an industrial scene containing a workbench, UR10e robotic arm, conveyor belt, and microplate. The agent automatically plans the structure, generates USD code, executes construction, and verifies the layout correctness.

### Rapid Prototype Verification
Provide engineers and designers with a way to generate interactive 3D scenes without manual modeling. Although manual refinement is required, it significantly accelerates early design iterations.

## Deployment and Usage Process

### Environment Preparation
Adopt containerized deployment, run Isaac Sim and VLM services via Docker Compose. The first run requires pulling large container images and model weights.

### Configuration Process
Need to configure NVIDIA API key (for accessing inference services) and NGC API key (for pulling NIM images). The project provides environment variable templates and configuration instructions.

### Running Examples
Start with the inference agent verification script to test the simulation client connection, execute the preset workbench script, and finally run the complete agent loop. Each step has example scripts and expected outputs.

## Technical Highlights and Innovations

- **Pure Standard Library Implementation**: The RPC server in the Isaac Sim container is implemented using Python standard libraries with no external dependencies, avoiding conflicts with Kit's built-in libraries and ensuring stability and portability.
- **Modular Architecture**: Separate the simulation server, VLM wrapper, and agent main loop for easy independent testing, debugging, and expansion.
- **Reproducible Tracing**: Generate JSONL trace files for each run, recording tool call history and model responses for analysis, debugging, and replay.

## Limitations and Future Development Directions

As a proof-of-concept project, dt-agent currently mainly demonstrates technical feasibility. The success rate of complex scene construction, accuracy of visual observation, and robustness of generated code still need improvement. Future directions include supporting more complex asset operations, introducing physical constraint verification, and integrating other digital twin platforms.

## Project Significance and Value

dt-agent represents a beneficial exploration of AI-assisted 3D content generation, demonstrating the ability of collaborative multimodal large models to convert natural language intent into simulation scenes. For researchers and engineers in robot simulation, industrial digital twins, and automated testing, it provides a valuable reference implementation and technical ideas.