Zing Forum

Reading

NVLLM: A New Architecture for Edge Large Model Inference Based on 3D NAND

NVLLM achieves a breakthrough in efficiently running large models with 30B parameters on edge devices through its innovative architecture that offloads FFN computations to Flash storage while keeping attention computations in CMOS logic, delivering a 16-38x speedup compared to the A800 solution.

边缘计算大模型推理3D NAND存算一体AI芯片NVLLM端侧AI
Published 2026-04-28 22:26Recent activity 2026-04-29 11:00Estimated read 5 min
NVLLM: A New Architecture for Edge Large Model Inference Based on 3D NAND
1

Section 01

NVLLM: Introduction to the New Architecture for Edge Large Model Inference

NVLLM is a new architecture for edge large model inference based on 3D NAND. Its core innovation lies in offloading FFN computations to Flash storage while keeping attention computations in CMOS logic, enabling efficient operation of 30B-parameter models on edge devices. It delivers a 16-38x speedup compared to the A800 solution and addresses the memory-intensive bottleneck in edge inference.

2

Section 02

Background and Challenges of Edge Large Model Inference

Large language models face fundamental obstacles when running on edge devices: single-batch decoding is a memory-intensive task. Limitations of existing solutions: GPU out-of-core inference is constrained by the overhead of transferring weights between DRAM and storage; SSD accelerators have low efficiency in storage access granularity and cannot balance low power consumption, latency, and throughput requirements.

3

Section 03

Core Architecture Design of NVLLM

  1. Task Separation: FFN (accounting for over 90% of parameters) is offloaded to 3D NAND for execution, while attention computations are retained in CMOS logic in conjunction with DRAM;
  2. Wafer-level 3D Integration: Multi-layer NAND arrays + on-chip computation pipelines + integrated ECC + dedicated buffer layers, bypassing the DRAM bottleneck;
  3. Dot Product Primitive Execution Engine: PE arrays directly read NAND data, ECC and computation are parallelized, and out-of-order scheduling maximizes bandwidth;
  4. KV Cache-Aware Scheduler: Attention weights are stored in DRAM, with intelligent prefetching and dynamic adjustment to maintain stable throughput.
4

Section 04

NVLLM Performance Evaluation Results

Evaluated on OPT and LLaMA series models:

  1. Comparison with A800: 16.7-37.9x speedup, attributed to eliminating weight transfer, in-storage computation, and high density of 3D NAND;
  2. Comparison with SSD-like designs: Up to 4.7x speedup with only a 2.7% increase in CMOS area overhead, demonstrating the advantages of vertical integration and co-design.
5

Section 05

Technical Significance and Industry Impact of NVLLM

  1. Storage-Compute Fusion: Breaks the von Neumann memory wall;
  2. Edge Deployment: Enables efficient operation of 30B-parameter models on edge devices;
  3. Energy Efficiency Ratio: Achieves an order-of-magnitude improvement by reducing data transfer;
  4. Commercialization: Mature wafer-level stacking technology paves the way for mass production.
6

Section 06

Limitations and Future Outlook of NVLLM

Current Limitations: DRAM demand for attention computation remains a bottleneck; 3D NAND write endurance limits model update frequency; Adaptation to MoE architectures requires further research. Outlook: Opens a new path for edge large model inference; storage-centric design may influence future AI chip architectures.