Zing Forum

Reading

Lever: A Flash-based Speculative Decoding LLM Inference System for Smartphones

This article introduces the Lever system, which enables efficient flash-resident LLM inference on smartphones through I/O-computation-aware token tree construction, early exit prediction pruning, and CPU-NPU collaborative execution, reducing latency by 2.93x compared to the baseline.

移动LLM推理推测解码闪存优化智能手机端侧AICPU-NPU协同I/O感知调度
Published 2026-05-16 11:43Recent activity 2026-05-19 10:22Estimated read 7 min
Lever: A Flash-based Speculative Decoding LLM Inference System for Smartphones
1

Section 01

Introduction: Lever—A Flash-Resident LLM Inference System for Smartphones

This article introduces the Lever system, an optimized flash-resident LLM inference system for smartphones. It addresses the memory bottleneck of LLM inference on mobile devices through three core technologies: I/O-computation-aware token tree construction, early exit prediction pruning, and CPU-NPU collaborative execution. Compared to baseline methods, it reduces latency by 2.93x, making it possible for high-quality large models to run efficiently on mobile phones.

2

Section 02

Dilemmas of Mobile LLM Inference: Memory Bottleneck and the Double-Edged Sword of Flash Memory

Deploying LLMs on mobile devices faces two major challenges:

  1. Memory Bottleneck: Smartphone DRAM (6-12GB) cannot accommodate 7B parameter models, requiring compression which leads to quality degradation;
  2. Flash Limitations: Flash memory has ample capacity but is 2-3 orders of magnitude slower than DRAM. Frequent flash access in traditional inference causes severe I/O bottlenecks.
3

Section 03

Mobile Adaptability of Speculative Decoding and Limitations of Traditional Methods

Speculative decoding is an adaptive solution for mobile LLM inference: DRAM stores lightweight draft models (100M-1B parameters), flash memory stores the complete target model, and flash access is reduced by generating candidates via the draft model and batch-verifying them with the target model. However, traditional speculative decoding has limitations:

  • High I/O latency
  • Limited parallelism of mobile NPUs
  • Irregular execution process
  • Difficulty in coordinating heterogeneous computing
4

Section 04

Lever System Architecture: Three Core Optimization Strategies

The Lever system architecture optimizes from three aspects:

  1. Draft Phase: I/O-computation-aware token tree construction. It prioritizes exploring high-value branches via a gain-cost function (maximizing Gain/Cost) and dynamically adjusts the tree's width and depth;
  2. Verification Phase: Early exit prediction pruning. It real-time evaluates branch value and terminates low-probability branches early, reducing verification computation by 30-50%;
  3. Execution Phase: CPU-NPU collaborative scheduling. Task partitioning (draft/NPU, token tree/CPU, etc.) plus three-level pipeline parallelism to hide I/O latency.
5

Section 05

Lever Technical Details: Flash Memory, Quantization, and Memory Management Optimization

Additional Lever technical details:

  • Flash Optimization: Parameter chunking with on-demand loading, prefetching predicted parameter chunks, and compressed transmission to reduce bandwidth;
  • Quantization Strategy: Draft model in INT8, target model in FP16/INT8, and high precision for key layers;
  • Memory Management: Resident memory (draft + KV cache), dynamic memory (temporary activation), and flash cache (LRU-managed parameter chunks).
6

Section 06

Experimental Evaluation: Lever's Performance

Experimental results show Lever's significant performance:

  • Latency Comparison: 2.93x faster than pure flash-offloaded inference, 1.5x faster than traditional speculative decoding, and close to the ideal memory-resident scenario;
  • Key Metrics: Token acceptance rate of 65-75% (higher than traditional 45-55%), I/O read volume reduced by 60%, energy consumption decreased by 40%;
  • End-to-End Applications: Dialogue assistant response time reduced from 8s to 2.7s, document summarization speed increased by 2.5x, and code completion meets real-time interaction requirements.
7

Section 07

Limitations and Future Directions

Current Limitations:

  • Maximum model size is 7B parameters;
  • Dependent on NPU architecture, requiring adjustments for specific chips;
  • Significant cold start latency. Future Directions:
  • Model-system co-design;
  • Edge-cloud collaboration;
  • Personalized adaptation to user devices and usage patterns.
8

Section 08

Practical Significance and Summary

Practical Significance of Lever:

  • Breaks memory barriers and proves the feasibility of flash-resident inference;
  • Maintains model quality without excessive compression;
  • Promotes the practical application of mobile AI (privacy protection, offline availability, low latency, cost reduction). Summary: Lever achieves efficient flash-resident LLM inference on mobile phones through three core technologies, paving the way for the popularization of edge AI.