Zing Forum

Reading

Zero-TVM: Running Phi-3-mini in Browsers with Handwritten WGSL Shaders, Challenging Compiler Hegemony

A browser-side LLM inference project that replaces the Apache-TVM compiler stack with just 10 handwritten WGSL kernels (3000 lines of code). It achieves 40 tok/s on M2 Pro, only 22% slower than WebLLM's auto-tuned version, while enabling a fully auditable GPU compute stack.

WebGPUWGSLPhi-3LLM推理浏览器AITVMWebLLMGPU计算int4量化开源
Published 2026-04-21 22:16Recent activity 2026-04-21 22:21Estimated read 6 min
Zero-TVM: Running Phi-3-mini in Browsers with Handwritten WGSL Shaders, Challenging Compiler Hegemony
1

Section 01

Zero-TVM Project Overview: Handwritten WGSL Shaders for Browser LLM Inference

Zero-TVM is a browser-side LLM inference project that replaces the complex Apache-TVM compiler stack with handwritten WGSL shaders. It uses only 10 kernel roles (27 WGSL files, ~3k lines of code) and ~2k lines of TypeScript to run Phi-3-mini-4k-instruct in browsers. On M2 Pro, it achieves ~40 tok/s—only 22% slower than WebLLM's auto-tuned TVM version—while providing a fully readable, auditable GPU compute stack.

2

Section 02

Background: Complex Compiler Stacks & Project Inspiration

Existing browser LLM solutions like WebLLM/MLC rely on Apache-TVM to generate 85 auto-tuned WGSL kernels plus WASM schedulers. Zero-TVM questions this complexity, inspired by Andrej Karpathy's llm.c (pure C/CUDA GPT-2 training). It aims to bring the same 'minimal, handwritten' philosophy to the browser using WebGPU, int4 quantization, paged KV cache, and modern Transformer architecture.

3

Section 03

Method: Kernel Fusion & Technical Pipeline

Zero-TVM uses active kernel fusion to reduce scheduling overhead:

  • qkv_fused: Merges Q/K/V projection, RoPE, and KV cache appending into one scheduling step.
  • attention: Combines paged attention with page table reads.
  • fused_ffn: Fuses gating, upsampling, and SiLU activation.
  • add_norm: Merges residual connection and RMSNorm.

Its pipeline includes: 32 Transformer decoder layers, vLLM-style paged KV cache (1.6GB, optional int8 to halve memory), int4 dequantization matmul variants, RoPE (in QKV kernel), greedy decoding, and a handwritten BPE tokenizer (~280 lines TypeScript).

Comparison with WebLLM:

Metric WebLLM Zero-TVM
Unique WGSL Kernels 85 10 roles/27 files
WGSL Lines 12,962 (generated) 3,078 (handwritten)
JS Bundle Size (gzip) 2.1MB 33kB
4

Section 04

Evidence: Performance Benchmarks & Observations

On M2 Pro (Chrome120+), Zero-TVM runs at ~40 tok/s vs WebLLM's51 tok/s (22% gap). This gap is acceptable because:

  1. WebLLM uses auto-tuned kernels for specific hardware, while Zero-TVM's handwritten kernels have no auto-tuning.
  2. The gap is smaller than expected—suggesting compiler complexity doesn't translate to proportional performance gains for decoder-only LLMs.

The project documented failed optimizations (e.g., QKV chunking strategies that regressed performance) and speculative decoding CPU simulations with low acceptance rates, showing a 'measure-first' approach.

5

Section 05

Core Value: Fully Auditable Compute Stack

Zero-TVM's biggest value is auditability: every FLOP, GPU buffer, and scheduling step is in human-readable code (WGSL + TypeScript). Use cases include:

  • Adding instrumentation for layers.
  • Testing new attention patterns.
  • Teaching browser LLM inference fundamentals.

No compiler black boxes or generated code—ideal for research, education, and trusted AI systems.

6

Section 06

Limitations & Transparent Disclosures

Zero-TVM has several limitations:

  • Model Specificity: Only supports Phi-3-mini-4k-instruct Q4 (hardcoded architecture constants in shaders).
  • GPU Memory: ~3.6GB (1.8GB weights +1.6GB KV cache) → OOM on 4GB integrated GPUs.
  • WebGPU: Requires Chrome/Edge with shader-f16 (Safari not supported).
  • Tokenizer: Handwritten BPE doesn't fully match HuggingFace's pipeline (issues with emojis/Unicode).
  • Sampling: Only greedy decoding (no temperature/top-k/top-p).
  • Pre-filling: Sequential (batch pre-fill shaders exist but not integrated).
7

Section 07

Conclusion: Reimagining AI Infrastructure Complexity

Zero-TVM challenges over-reliance on complex compilers. The 22% performance trade-off is worth it for scenarios needing transparency (education, research, auditable deployments). It proves handwritten WGSL + TypeScript can run Phi-3-mini in browsers effectively—echoing llm.c's message: minimal, readable code can achieve 'good enough' performance for specific workloads.