Section 01
LlamaWeb: A WebGPU Solution for Efficiently Running Large Language Models in Browsers
LlamaWeb is a WebGPU-based backend for llama.cpp that supports efficient running of large language models (LLMs) in browsers. Through innovations like static memory planning and an adjustable kernel library, it achieves cross-device performance portability. Compared to existing solutions, it reduces memory usage by 29-33% and increases decoding throughput by 45-69%, providing a new privacy-preserving, efficient, and cross-platform option for browser-based AI applications.