Zing Forum

Reading

LLM Cluster Simulator: A Browser-Based Distributed GPU Cluster Planning Tool

LLM Cluster Simulator is a browser-based analytical simulator that allows estimating MFU, memory, throughput, and costs for distributed LLM training and inference without any GPU access. It supports over 70 models and 25 GPU configurations, helping developers make informed parallel strategy decisions before actual deployment.

LLMGPUdistributed traininginferencesimulatorparallelismclusterMFUDeepSeekLLaMA
Published 2026-04-01 16:15Recent activity 2026-04-01 16:28Estimated read 7 min
LLM Cluster Simulator: A Browser-Based Distributed GPU Cluster Planning Tool
1

Section 01

LLM Cluster Simulator: Browser-Based Distributed GPU Cluster Planning Tool (Main Guide)

LLM Cluster Simulator is an innovative browser-based analysis tool that allows developers to estimate MFU, memory, throughput, and cost for distributed LLM training and inference without any GPU access. It supports over 70 models and 25 GPU configurations, helping users make informed parallel strategy decisions before actual deployment. Its core value lies in solving complex planning challenges via first-principle physical models.

2

Section 02

Background: Pain Points in Distributed LLM Planning

Planning distributed LLM training/inference faces two main issues: 1) High cost and hardware access barriers for real cluster experiments; 2) Rough estimates failing when complex parallelism (like pipeline/expert parallelism) or mixed-precision communication is involved. The simulator offers a third option: first-principle calculations (FLOPs, byte transfer, pipeline bubbles) in the browser, no hardware needed.

3

Section 03

Technical Validation: Alignment with Real-World Training Runs

The simulator's model is calibrated against published real training runs, showing high consistency:

Model GPU Configuration Strategy Simulated MFU Actual MFU Source
LLaMA 3.1 405B 16384× H100 3D (TP8 PP16) 41.1% ~40% Meta
LLaMA 3.1 405B 131K 16384× H100 3D + CP16 37.2% 38% Meta
DeepSeek V3 671B FP8 2048× H800 3D + EP32 44.7% 43.7% DeepSeek
Nemotron-4 340B 6144× H100 3D (TP8 PP12) 41.2% 41-42% NVIDIA
OLMo 3 32B 1024× H100 FSDP (DP=1024) 43.4% ~41% OLMo 3

Long-sequence MFU uses model FLOPs MFU (including quadratic attention FLOPs), consistent with real training.

4

Section 04

Supported Models & Hardware Configurations

Models: Over 70 models covering architectures (Dense, MoE, MLA, GQA) and families (LLaMA, DeepSeek, Qwen, Mistral, Gemma, Phi, Grok, GLM, OLMo, Kimi etc.).

GPUs: 25 types from consumer to data center: NVIDIA (A100, H100, H800, B200, RTX4090), AMD (MI300X), and Chinese versions (A800, H800).

5

Section 05

Simulation Capabilities for Training & Inference

Training Scenarios: Addresses questions like 'How many H100s for 70B model training in 30 days?' Features include LoRA/QLoRA, FP8/FP4 mixed precision, selective activation checkpointing, cost prediction (cloud pricing), and auto optimizer for optimal parallel layout.

Inference Scenarios: Answers questions like 'TTFT/TPOT of LLaMA70B on 8×H100 with speculative decoding?' Features include TTFT/TPOT estimation, speculative decoding, continuous batch processing, quantization (GGUF, GPTQ, AWQ, INT4/INT8), paged attention (vLLM-style), prefix caching, and separated prefill/decoding.

6

Section 06

Full Parallel Strategy Stack Support

Data Parallel: DDP, ZeRO, FSDP.

Model Parallel: TP (tensor parallel, intra-layer sharding), PP (pipeline parallel, inter-layer sharding with 1F1B, interleaved, DualPipeV scheduling), CP (context parallel, long sequence sharding), SP (sequence parallel), EP (expert parallel for MoE).

7

Section 07

Learning Resources & Client-Side Tech Stack

Learning Resources: Learn Mode (60 structured tasks across 6 paths: training/inference, beginner to advanced) with scenarios, success criteria, and prompts; Space RPG (narrative campaign to learn parallel strategies via story).

Tech Stack: Client-side only (no backend) using React19, TypeScript, Vite7, Tailwind CSS4, Zustand, Vitest. All calculations run in browser, data stays on user device (data-sensitive friendly).

8

Section 08

Limitations, Future Directions & Summary

Limitations: No non-training overheads (checkpoints, data loading, fault recovery); no TPU/Trainium/Inferentia support; no non-IB clusters.

Future Plans: Integrate FA3 (fusion/custom kernels), NVMe/CPU offload, runtime optimizations, service frameworks (vLLM/TensorRT), and post-training (RLHF, RLVR, PPO, GRPO).

Summary: The simulator is an innovative tool for distributed LLM planning, offering precise pre-deployment estimates with credible results (calibrated against industry leaders). It's valuable for teams planning large AI projects and learners understanding distributed training.