# Zero-TVM: Running Phi-3-mini in Browsers with Handwritten WGSL Shaders, Challenging Compiler Hegemony

> A browser-side LLM inference project that replaces the Apache-TVM compiler stack with just 10 handwritten WGSL kernels (3000 lines of code). It achieves 40 tok/s on M2 Pro, only 22% slower than WebLLM's auto-tuned version, while enabling a fully auditable GPU compute stack.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-21T14:16:10.000Z
- 最近活动: 2026-04-21T14:21:51.557Z
- 热度: 154.9
- 关键词: WebGPU, WGSL, Phi-3, LLM推理, 浏览器AI, TVM, WebLLM, GPU计算, int4量化, 开源
- 页面链接: https://www.zingnex.cn/en/forum/thread/zero-tvm-wgsl-phi-3-mini
- Canonical: https://www.zingnex.cn/forum/thread/zero-tvm-wgsl-phi-3-mini
- Markdown 来源: floors_fallback

---

## Zero-TVM Project Overview: Handwritten WGSL Shaders for Browser LLM Inference

Zero-TVM is a browser-side LLM inference project that replaces the complex Apache-TVM compiler stack with handwritten WGSL shaders. It uses only 10 kernel roles (27 WGSL files, ~3k lines of code) and ~2k lines of TypeScript to run Phi-3-mini-4k-instruct in browsers. On M2 Pro, it achieves ~40 tok/s—only 22% slower than WebLLM's auto-tuned TVM version—while providing a fully readable, auditable GPU compute stack.

## Background: Complex Compiler Stacks & Project Inspiration

Existing browser LLM solutions like WebLLM/MLC rely on Apache-TVM to generate 85 auto-tuned WGSL kernels plus WASM schedulers. Zero-TVM questions this complexity, inspired by Andrej Karpathy's llm.c (pure C/CUDA GPT-2 training). It aims to bring the same 'minimal, handwritten' philosophy to the browser using WebGPU, int4 quantization, paged KV cache, and modern Transformer architecture.

## Method: Kernel Fusion & Technical Pipeline

Zero-TVM uses active kernel fusion to reduce scheduling overhead:
- **qkv_fused**: Merges Q/K/V projection, RoPE, and KV cache appending into one scheduling step.
- **attention**: Combines paged attention with page table reads.
- **fused_ffn**: Fuses gating, upsampling, and SiLU activation.
- **add_norm**: Merges residual connection and RMSNorm.

Its pipeline includes: 32 Transformer decoder layers, vLLM-style paged KV cache (1.6GB, optional int8 to halve memory), int4 dequantization matmul variants, RoPE (in QKV kernel), greedy decoding, and a handwritten BPE tokenizer (~280 lines TypeScript).

Comparison with WebLLM:
| Metric | WebLLM | Zero-TVM |
|--------|--------|----------|
| Unique WGSL Kernels |85 |10 roles/27 files |
| WGSL Lines |12,962 (generated)|3,078 (handwritten)| 
| JS Bundle Size (gzip)|2.1MB |33kB |

## Evidence: Performance Benchmarks & Observations

On M2 Pro (Chrome120+), Zero-TVM runs at ~40 tok/s vs WebLLM's51 tok/s (22% gap). This gap is acceptable because:
1. WebLLM uses auto-tuned kernels for specific hardware, while Zero-TVM's handwritten kernels have no auto-tuning.
2. The gap is smaller than expected—suggesting compiler complexity doesn't translate to proportional performance gains for decoder-only LLMs.

The project documented failed optimizations (e.g., QKV chunking strategies that regressed performance) and speculative decoding CPU simulations with low acceptance rates, showing a 'measure-first' approach.

## Core Value: Fully Auditable Compute Stack

Zero-TVM's biggest value is auditability: every FLOP, GPU buffer, and scheduling step is in human-readable code (WGSL + TypeScript). Use cases include:
- Adding instrumentation for layers.
- Testing new attention patterns.
- Teaching browser LLM inference fundamentals.

No compiler black boxes or generated code—ideal for research, education, and trusted AI systems.

## Limitations & Transparent Disclosures

Zero-TVM has several limitations:
- **Model Specificity**: Only supports Phi-3-mini-4k-instruct Q4 (hardcoded architecture constants in shaders).
- **GPU Memory**: ~3.6GB (1.8GB weights +1.6GB KV cache) → OOM on 4GB integrated GPUs.
- **WebGPU**: Requires Chrome/Edge with shader-f16 (Safari not supported).
- **Tokenizer**: Handwritten BPE doesn't fully match HuggingFace's pipeline (issues with emojis/Unicode).
- **Sampling**: Only greedy decoding (no temperature/top-k/top-p).
- **Pre-filling**: Sequential (batch pre-fill shaders exist but not integrated).

## Conclusion: Reimagining AI Infrastructure Complexity

Zero-TVM challenges over-reliance on complex compilers. The 22% performance trade-off is worth it for scenarios needing transparency (education, research, auditable deployments). It proves handwritten WGSL + TypeScript can run Phi-3-mini in browsers effectively—echoing llm.c's message: minimal, readable code can achieve 'good enough' performance for specific workloads.