# DGX Spark LLM Practical Notes: A Complete Guide to Running Large Models on Desktop AI Supercomputers

> A practical DGX Spark large model deployment note based on real hardware tests, covering detailed configurations and performance benchmarking of inference engines like llama.cpp, vLLM, and Atlas, as well as a trade-off analysis between single-card and dual-card deployment.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-09T10:41:38.000Z
- 最近活动: 2026-06-09T10:52:44.893Z
- 热度: 147.8
- 关键词: DGX Spark, NVIDIA, LLM inference, Step 3.7, vLLM, llama.cpp, Atlas, Blackwell, NVFP4, multi-node, benchmark
- 页面链接: https://www.zingnex.cn/en/forum/thread/dgx-spark-llm-ai
- Canonical: https://www.zingnex.cn/forum/thread/dgx-spark-llm-ai
- Markdown 来源: floors_fallback

---

## DGX Spark LLM Practical Notes Introduction: Guide to Large Model Deployment on Desktop AI Supercomputers

This article is a practical DGX Spark large model deployment note based on real hardware tests. It covers detailed configurations and performance benchmarking of inference engines such as llama.cpp, vLLM, and Atlas, analyzes the trade-offs between single-card and dual-card deployment, and compares the quality performance of different models on this hardware, providing practical references for DGX Spark users and relevant developers.

## Background: DGX Spark Hardware and Project Overview

NVIDIA DGX Spark is a desktop AI supercomputer equipped with the GB10 chip (Blackwell architecture GPU supporting NVLink), 128GB unified memory, and a 20-core ARM CPU. This project is a collection of practical notes from the author's team testing LLMs on real DGX Spark hardware, featuring real tests (including failure cases), rapid iteration, and in-depth detail records (successful configurations and failure reasons).

## Inference Engine Comparison and Configuration Key Points

1. **Atlas** (written in Rust, AI-first design): Native Blackwell support, no Python overhead. The author's team is contributing DGX Spark support (Step3.7 Flash NVFP4 quantization, etc., related PR #119 is in progress);
2. **vLLM**: Upstream does not support multi-node tensor parallelism by default. Need to use StepFun fork and apply patches to implement multi-node NCCL, and configure dual Spark TP=2 with Ray;
3. **llama.cpp/Ollama**: The simplest path to run on a single Spark, supports GGUF format, with simple configuration and good throughput.

## Trade-off Analysis Between Single-Card and Dual-Card Deployment

Comparison for Step3.7 Flash:
| Dimension | Single Spark | Dual Spark |
| --- | --- | --- |
| Engine | llama.cpp (Q4_K_S GGUF) | vLLM (NVFP4, StepFun fork) |
| Throughput | ~27 tok/s | ~18.5 tok/s (RoCE) |
| Context | 96K (stability issues) | 262K |
| Quantization | Q4_K_S | NVFP4 |
| Complexity | Low | High |
Core conclusion: Single card is faster and simpler; dual card unlocks the full 262K context. Physical limitation: NVFP4 model weights (about 121GB) cannot fit into single-card memory, which is the fundamental reason for dual-card deployment.

## Model Quality Comparison Results

Test task: Write a report on the status of DGX Spark local LLM inference in June 2026. Comparison between Step3.7 Flash (198B MoE) and Qwen3.5 122B (MoE):
1. Step3.7 is more in-depth (more searches, sources, contradiction analysis);
2. Qwen3.5 is faster and more concise (6.7x faster, more actionable output);
3. Both have hallucination risks (source URLs not fully verified).

## Practical Value and Application Scenarios

Target audience:
1. Existing DGX Spark users (to avoid pitfalls);
2. Potential buyers (performance data and complexity evaluation);
3. LLM inference optimization researchers (engine/quantization/parallel strategy comparison);
4. MoE deployment engineers (Step3.7/DeepSeek V4 experience).
The value of the notes lies in real trial-and-error records (Docker permissions, NCCL variables, patches, etc.), which are more referenceable for early hardware adopters.
