Section 01
Introduction: Lever—A Flash-Resident LLM Inference System for Smartphones
This article introduces the Lever system, an optimized flash-resident LLM inference system for smartphones. It addresses the memory bottleneck of LLM inference on mobile devices through three core technologies: I/O-computation-aware token tree construction, early exit prediction pruning, and CPU-NPU collaborative execution. Compared to baseline methods, it reduces latency by 2.93x, making it possible for high-quality large models to run efficiently on mobile phones.