# PALUTE: In-Memory Lookup Table-Based Accelerator Empowers Edge Large Language Model Inference

> PALUTE uses monolithic 3D DRAM to implement in-memory lookup table queries, achieving a throughput of 1264 TPS at 0.16W power consumption. It delivers 12.8x higher energy efficiency than existing solutions, providing an efficient approach for deploying LLMs on edge devices.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-08T00:33:44.000Z
- 最近活动: 2026-06-09T02:52:21.997Z
- 热度: 133.7
- 关键词: 大语言模型, 边缘推理, 存内计算, 查找表, 三维DRAM, AI加速器, 低功耗, 量化推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/palute
- Canonical: https://www.zingnex.cn/forum/thread/palute
- Markdown 来源: floors_fallback

---

## [Main Floor/Introduction] PALUTE: In-Memory Lookup Table-Based Edge LLM Inference Accelerator

PALUTE is an in-memory computing accelerator designed for edge large language model (LLM) inference. Its core innovation lies in using monolithic 3D DRAM (M3D DRAM) to enable in-memory lookup table (LUT) queries. It achieves a throughput of 1264 TPS at 0.16W power consumption and 12.8x higher energy efficiency than existing solutions, offering an efficient solution for deploying LLMs on edge devices.

Original authors: arXiv authors | Source: arXiv (2026-06-08) | Paper link: http://arxiv.org/abs/2606.08891v1

## Background: Core Challenges of Edge LLM Inference

The demand for LLMs on edge devices (e.g., mobile phones, IoT devices) is growing, but it faces three key constraints:
1. Tight power budget (mobile devices have an upper limit of only a few watts);
2. Limited chip area (impacting cost and heat dissipation);
3. Memory bandwidth bottleneck (far weaker than data centers).

Traditional low-bit quantization schemes reduce storage and computation loads but introduce overhead from dequantization and nonlinear operations, creating a new bottleneck.

## Method: Architectural Innovations of PALUTE

PALUTE combines LUT methods with M3D DRAM technology. Key designs include:
1. **M3D DRAM Vertical Organization**: Uses vertically stacked storage layers to support high-parallel lookup and reduce area overhead;
2. **Near-Memory LUT Generator**: Quickly generates LUTs for GEMM/nonlinear operators, with dynamic updates to avoid static table capacity pressure;
3. **System-Level Scheduling**: Intelligently predicts access patterns, prefetches data, and minimizes cross-layer data movement.

## Evidence: Performance and Energy Efficiency of PALUTE

Tested on the Qwen3-4B model (W4A4 quantization):
- Throughput: 1264 TPS;
- Power consumption: 0.16W;
- Energy efficiency comparison: 12.8x higher than CHIME, 1.6x higher than FIGLUT;
- Area efficiency: 2.0x higher than PIMPAL.

## Technical Details: In-Memory Computing and LUT Optimization

1. **In-Memory Computing Advantages**: Reduces data movement energy consumption (data movement energy in traditional architectures is far higher than computation);
2. **LUT Compression Coding**: Uses differential coding, piecewise linear approximation, and adaptive precision to optimize storage efficiency;
3. **Quantization Collaboration**: Optimized for W4A4 low-bit scenarios, leveraging quantization regularity.

## Application Scenarios: Edge Deployment Directions of PALUTE

Suitable for:
1. Smartphone on-device AI (offline translation, privacy document processing);
2. IoT and edge gateways (industrial quality inspection, intelligent monitoring);
3. Autonomous driving and robots (real-time perception and decision-making).

## Limitations and Future Outlook

**Current Limitations**:
- Model scale: Only verified on 4B parameter models; scalability for larger models needs validation;
- Versatility: Optimized for Transformers; other networks require adjustments;
- Process dependency: M3D DRAM maturity affects deployment.

**Future Directions**:
- Support 7B/13B models;
- Multimodal expansion;
- Dynamic precision adjustment;
- Improve software stack (compiler, runtime).

## Conclusion: Edge AI Value of PALUTE

PALUTE combines LUT and M3D DRAM to resolve the power-performance contradiction in edge LLM inference, marking an important advancement in edge AI accelerators. As hardware matures and software improves, smooth operation of large models on edge devices will become the norm in the future.