Reading

Lever: A Flash-based Speculative Decoding LLM Inference System for Smartphones

This article introduces the Lever system, which enables efficient flash-resident LLM inference on smartphones through I/O-computation-aware token tree construction, early exit prediction pruning, and CPU-NPU collaborative execution, reducing latency by 2.93x compared to the baseline.

移动LLM推理推测解码闪存优化智能手机端侧AICPU-NPU协同I/O感知调度

Published 2026-05-16 11:43Recent activity 2026-05-19 10:22Estimated read 7 min

Lever: A Flash-based Speculative Decoding LLM Inference System for Smartphones

Section 01

Introduction: Lever—A Flash-Resident LLM Inference System for Smartphones

This article introduces the Lever system, an optimized flash-resident LLM inference system for smartphones. It addresses the memory bottleneck of LLM inference on mobile devices through three core technologies: I/O-computation-aware token tree construction, early exit prediction pruning, and CPU-NPU collaborative execution. Compared to baseline methods, it reduces latency by 2.93x, making it possible for high-quality large models to run efficiently on mobile phones.

Section 02

Dilemmas of Mobile LLM Inference: Memory Bottleneck and the Double-Edged Sword of Flash Memory

Deploying LLMs on mobile devices faces two major challenges:

Memory Bottleneck: Smartphone DRAM (6-12GB) cannot accommodate 7B parameter models, requiring compression which leads to quality degradation;
Flash Limitations: Flash memory has ample capacity but is 2-3 orders of magnitude slower than DRAM. Frequent flash access in traditional inference causes severe I/O bottlenecks.

Section 03

Mobile Adaptability of Speculative Decoding and Limitations of Traditional Methods

Speculative decoding is an adaptive solution for mobile LLM inference: DRAM stores lightweight draft models (100M-1B parameters), flash memory stores the complete target model, and flash access is reduced by generating candidates via the draft model and batch-verifying them with the target model. However, traditional speculative decoding has limitations:

High I/O latency
Limited parallelism of mobile NPUs
Irregular execution process
Difficulty in coordinating heterogeneous computing

Section 04

Lever System Architecture: Three Core Optimization Strategies

The Lever system architecture optimizes from three aspects:

Draft Phase: I/O-computation-aware token tree construction. It prioritizes exploring high-value branches via a gain-cost function (maximizing Gain/Cost) and dynamically adjusts the tree's width and depth;
Verification Phase: Early exit prediction pruning. It real-time evaluates branch value and terminates low-probability branches early, reducing verification computation by 30-50%;
Execution Phase: CPU-NPU collaborative scheduling. Task partitioning (draft/NPU, token tree/CPU, etc.) plus three-level pipeline parallelism to hide I/O latency.

Section 05

Lever Technical Details: Flash Memory, Quantization, and Memory Management Optimization

Additional Lever technical details:

Flash Optimization: Parameter chunking with on-demand loading, prefetching predicted parameter chunks, and compressed transmission to reduce bandwidth;
Quantization Strategy: Draft model in INT8, target model in FP16/INT8, and high precision for key layers;
Memory Management: Resident memory (draft + KV cache), dynamic memory (temporary activation), and flash cache (LRU-managed parameter chunks).

Section 06

Experimental Evaluation: Lever's Performance

Experimental results show Lever's significant performance:

Latency Comparison: 2.93x faster than pure flash-offloaded inference, 1.5x faster than traditional speculative decoding, and close to the ideal memory-resident scenario;
Key Metrics: Token acceptance rate of 65-75% (higher than traditional 45-55%), I/O read volume reduced by 60%, energy consumption decreased by 40%;
End-to-End Applications: Dialogue assistant response time reduced from 8s to 2.7s, document summarization speed increased by 2.5x, and code completion meets real-time interaction requirements.

Section 07

Limitations and Future Directions

Current Limitations:

Maximum model size is 7B parameters;
Dependent on NPU architecture, requiring adjustments for specific chips;
Significant cold start latency. Future Directions:
Model-system co-design;
Edge-cloud collaboration;
Personalized adaptation to user devices and usage patterns.

Section 08

Practical Significance and Summary

Practical Significance of Lever:

Breaks memory barriers and proves the feasibility of flash-resident inference;
Maintains model quality without excessive compression;
Promotes the practical application of mobile AI (privacy protection, offline availability, low latency, cost reduction). Summary: Lever achieves efficient flash-resident LLM inference on mobile phones through three core technologies, paving the way for the popularization of edge AI.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15