Zing Forum

Reading

DGX Spark LLM Practical Notes: A Complete Guide to Running Large Models on Desktop AI Supercomputers

A practical DGX Spark large model deployment note based on real hardware tests, covering detailed configurations and performance benchmarking of inference engines like llama.cpp, vLLM, and Atlas, as well as a trade-off analysis between single-card and dual-card deployment.

DGX SparkNVIDIALLM inferenceStep 3.7vLLMllama.cppAtlasBlackwellNVFP4multi-node
Published 2026-06-09 18:41Recent activity 2026-06-09 18:52Estimated read 5 min
DGX Spark LLM Practical Notes: A Complete Guide to Running Large Models on Desktop AI Supercomputers
1

Section 01

DGX Spark LLM Practical Notes Introduction: Guide to Large Model Deployment on Desktop AI Supercomputers

This article is a practical DGX Spark large model deployment note based on real hardware tests. It covers detailed configurations and performance benchmarking of inference engines such as llama.cpp, vLLM, and Atlas, analyzes the trade-offs between single-card and dual-card deployment, and compares the quality performance of different models on this hardware, providing practical references for DGX Spark users and relevant developers.

2

Section 02

Background: DGX Spark Hardware and Project Overview

NVIDIA DGX Spark is a desktop AI supercomputer equipped with the GB10 chip (Blackwell architecture GPU supporting NVLink), 128GB unified memory, and a 20-core ARM CPU. This project is a collection of practical notes from the author's team testing LLMs on real DGX Spark hardware, featuring real tests (including failure cases), rapid iteration, and in-depth detail records (successful configurations and failure reasons).

3

Section 03

Inference Engine Comparison and Configuration Key Points

  1. Atlas (written in Rust, AI-first design): Native Blackwell support, no Python overhead. The author's team is contributing DGX Spark support (Step3.7 Flash NVFP4 quantization, etc., related PR #119 is in progress);
  2. vLLM: Upstream does not support multi-node tensor parallelism by default. Need to use StepFun fork and apply patches to implement multi-node NCCL, and configure dual Spark TP=2 with Ray;
  3. llama.cpp/Ollama: The simplest path to run on a single Spark, supports GGUF format, with simple configuration and good throughput.
4

Section 04

Trade-off Analysis Between Single-Card and Dual-Card Deployment

Comparison for Step3.7 Flash:

Dimension Single Spark Dual Spark
Engine llama.cpp (Q4_K_S GGUF) vLLM (NVFP4, StepFun fork)
Throughput ~27 tok/s ~18.5 tok/s (RoCE)
Context 96K (stability issues) 262K
Quantization Q4_K_S NVFP4
Complexity Low High
Core conclusion: Single card is faster and simpler; dual card unlocks the full 262K context. Physical limitation: NVFP4 model weights (about 121GB) cannot fit into single-card memory, which is the fundamental reason for dual-card deployment.
5

Section 05

Model Quality Comparison Results

Test task: Write a report on the status of DGX Spark local LLM inference in June 2026. Comparison between Step3.7 Flash (198B MoE) and Qwen3.5 122B (MoE):

  1. Step3.7 is more in-depth (more searches, sources, contradiction analysis);
  2. Qwen3.5 is faster and more concise (6.7x faster, more actionable output);
  3. Both have hallucination risks (source URLs not fully verified).
6

Section 06

Practical Value and Application Scenarios

Target audience:

  1. Existing DGX Spark users (to avoid pitfalls);
  2. Potential buyers (performance data and complexity evaluation);
  3. LLM inference optimization researchers (engine/quantization/parallel strategy comparison);
  4. MoE deployment engineers (Step3.7/DeepSeek V4 experience). The value of the notes lies in real trial-and-error records (Docker permissions, NCCL variables, patches, etc.), which are more referenceable for early hardware adopters.