Reading

Zero-TVM: Running Phi-3-mini in Browsers with Handwritten WGSL Shaders, Challenging Compiler Hegemony

A browser-side LLM inference project that replaces the Apache-TVM compiler stack with just 10 handwritten WGSL kernels (3000 lines of code). It achieves 40 tok/s on M2 Pro, only 22% slower than WebLLM's auto-tuned version, while enabling a fully auditable GPU compute stack.

WebGPUWGSLPhi-3LLM推理浏览器AITVMWebLLMGPU计算int4量化开源

Published 2026-04-21 22:16Recent activity 2026-04-21 22:21Estimated read 6 min

Zero-TVM: Running Phi-3-mini in Browsers with Handwritten WGSL Shaders, Challenging Compiler Hegemony

Section 01

Zero-TVM Project Overview: Handwritten WGSL Shaders for Browser LLM Inference

Zero-TVM is a browser-side LLM inference project that replaces the complex Apache-TVM compiler stack with handwritten WGSL shaders. It uses only 10 kernel roles (27 WGSL files, ~3k lines of code) and ~2k lines of TypeScript to run Phi-3-mini-4k-instruct in browsers. On M2 Pro, it achieves ~40 tok/s—only 22% slower than WebLLM's auto-tuned TVM version—while providing a fully readable, auditable GPU compute stack.

Section 02

Background: Complex Compiler Stacks & Project Inspiration

Existing browser LLM solutions like WebLLM/MLC rely on Apache-TVM to generate 85 auto-tuned WGSL kernels plus WASM schedulers. Zero-TVM questions this complexity, inspired by Andrej Karpathy's llm.c (pure C/CUDA GPT-2 training). It aims to bring the same 'minimal, handwritten' philosophy to the browser using WebGPU, int4 quantization, paged KV cache, and modern Transformer architecture.

Section 03

Method: Kernel Fusion & Technical Pipeline

Zero-TVM uses active kernel fusion to reduce scheduling overhead:

qkv_fused: Merges Q/K/V projection, RoPE, and KV cache appending into one scheduling step.
attention: Combines paged attention with page table reads.
fused_ffn: Fuses gating, upsampling, and SiLU activation.
add_norm: Merges residual connection and RMSNorm.

Its pipeline includes: 32 Transformer decoder layers, vLLM-style paged KV cache (1.6GB, optional int8 to halve memory), int4 dequantization matmul variants, RoPE (in QKV kernel), greedy decoding, and a handwritten BPE tokenizer (~280 lines TypeScript).

Comparison with WebLLM:

Metric	WebLLM	Zero-TVM
Unique WGSL Kernels	85	10 roles/27 files
WGSL Lines	12,962 (generated)	3,078 (handwritten)
JS Bundle Size (gzip)	2.1MB	33kB

Section 04

Evidence: Performance Benchmarks & Observations

On M2 Pro (Chrome120+), Zero-TVM runs at ~40 tok/s vs WebLLM's51 tok/s (22% gap). This gap is acceptable because:

WebLLM uses auto-tuned kernels for specific hardware, while Zero-TVM's handwritten kernels have no auto-tuning.
The gap is smaller than expected—suggesting compiler complexity doesn't translate to proportional performance gains for decoder-only LLMs.

The project documented failed optimizations (e.g., QKV chunking strategies that regressed performance) and speculative decoding CPU simulations with low acceptance rates, showing a 'measure-first' approach.

Section 05

Core Value: Fully Auditable Compute Stack

Zero-TVM's biggest value is auditability: every FLOP, GPU buffer, and scheduling step is in human-readable code (WGSL + TypeScript). Use cases include:

Adding instrumentation for layers.
Testing new attention patterns.
Teaching browser LLM inference fundamentals.

No compiler black boxes or generated code—ideal for research, education, and trusted AI systems.

Section 06

Limitations & Transparent Disclosures

Zero-TVM has several limitations:

Model Specificity: Only supports Phi-3-mini-4k-instruct Q4 (hardcoded architecture constants in shaders).
GPU Memory: ~3.6GB (1.8GB weights +1.6GB KV cache) → OOM on 4GB integrated GPUs.
WebGPU: Requires Chrome/Edge with shader-f16 (Safari not supported).
Tokenizer: Handwritten BPE doesn't fully match HuggingFace's pipeline (issues with emojis/Unicode).
Sampling: Only greedy decoding (no temperature/top-k/top-p).
Pre-filling: Sequential (batch pre-fill shaders exist but not integrated).

Section 07

Conclusion: Reimagining AI Infrastructure Complexity

Zero-TVM challenges over-reliance on complex compilers. The 22% performance trade-off is worth it for scenarios needing transparency (education, research, auditable deployments). It proves handwritten WGSL + TypeScript can run Phi-3-mini in browsers effectively—echoing llm.c's message: minimal, readable code can achieve 'good enough' performance for specific workloads.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49