Zing Forum

Reading

PALUTE: In-Memory Lookup Table-Based Accelerator Empowers Edge Large Language Model Inference

PALUTE uses monolithic 3D DRAM to implement in-memory lookup table queries, achieving a throughput of 1264 TPS at 0.16W power consumption. It delivers 12.8x higher energy efficiency than existing solutions, providing an efficient approach for deploying LLMs on edge devices.

大语言模型边缘推理存内计算查找表三维DRAMAI加速器低功耗量化推理
Published 2026-06-08 08:33Recent activity 2026-06-09 10:52Estimated read 5 min
PALUTE: In-Memory Lookup Table-Based Accelerator Empowers Edge Large Language Model Inference
1

Section 01

[Main Floor/Introduction] PALUTE: In-Memory Lookup Table-Based Edge LLM Inference Accelerator

PALUTE is an in-memory computing accelerator designed for edge large language model (LLM) inference. Its core innovation lies in using monolithic 3D DRAM (M3D DRAM) to enable in-memory lookup table (LUT) queries. It achieves a throughput of 1264 TPS at 0.16W power consumption and 12.8x higher energy efficiency than existing solutions, offering an efficient solution for deploying LLMs on edge devices.

Original authors: arXiv authors | Source: arXiv (2026-06-08) | Paper link: http://arxiv.org/abs/2606.08891v1

2

Section 02

Background: Core Challenges of Edge LLM Inference

The demand for LLMs on edge devices (e.g., mobile phones, IoT devices) is growing, but it faces three key constraints:

  1. Tight power budget (mobile devices have an upper limit of only a few watts);
  2. Limited chip area (impacting cost and heat dissipation);
  3. Memory bandwidth bottleneck (far weaker than data centers).

Traditional low-bit quantization schemes reduce storage and computation loads but introduce overhead from dequantization and nonlinear operations, creating a new bottleneck.

3

Section 03

Method: Architectural Innovations of PALUTE

PALUTE combines LUT methods with M3D DRAM technology. Key designs include:

  1. M3D DRAM Vertical Organization: Uses vertically stacked storage layers to support high-parallel lookup and reduce area overhead;
  2. Near-Memory LUT Generator: Quickly generates LUTs for GEMM/nonlinear operators, with dynamic updates to avoid static table capacity pressure;
  3. System-Level Scheduling: Intelligently predicts access patterns, prefetches data, and minimizes cross-layer data movement.
4

Section 04

Evidence: Performance and Energy Efficiency of PALUTE

Tested on the Qwen3-4B model (W4A4 quantization):

  • Throughput: 1264 TPS;
  • Power consumption: 0.16W;
  • Energy efficiency comparison: 12.8x higher than CHIME, 1.6x higher than FIGLUT;
  • Area efficiency: 2.0x higher than PIMPAL.
5

Section 05

Technical Details: In-Memory Computing and LUT Optimization

  1. In-Memory Computing Advantages: Reduces data movement energy consumption (data movement energy in traditional architectures is far higher than computation);
  2. LUT Compression Coding: Uses differential coding, piecewise linear approximation, and adaptive precision to optimize storage efficiency;
  3. Quantization Collaboration: Optimized for W4A4 low-bit scenarios, leveraging quantization regularity.
6

Section 06

Application Scenarios: Edge Deployment Directions of PALUTE

Suitable for:

  1. Smartphone on-device AI (offline translation, privacy document processing);
  2. IoT and edge gateways (industrial quality inspection, intelligent monitoring);
  3. Autonomous driving and robots (real-time perception and decision-making).
7

Section 07

Limitations and Future Outlook

Current Limitations:

  • Model scale: Only verified on 4B parameter models; scalability for larger models needs validation;
  • Versatility: Optimized for Transformers; other networks require adjustments;
  • Process dependency: M3D DRAM maturity affects deployment.

Future Directions:

  • Support 7B/13B models;
  • Multimodal expansion;
  • Dynamic precision adjustment;
  • Improve software stack (compiler, runtime).
8

Section 08

Conclusion: Edge AI Value of PALUTE

PALUTE combines LUT and M3D DRAM to resolve the power-performance contradiction in edge LLM inference, marking an important advancement in edge AI accelerators. As hardware matures and software improves, smooth operation of large models on edge devices will become the norm in the future.