Section 01
NVLLM: Introduction to the New Architecture for Edge Large Model Inference
NVLLM is a new architecture for edge large model inference based on 3D NAND. Its core innovation lies in offloading FFN computations to Flash storage while keeping attention computations in CMOS logic, enabling efficient operation of 30B-parameter models on edge devices. It delivers a 16-38x speedup compared to the A800 solution and addresses the memory-intensive bottleneck in edge inference.