Reading

PALUTE: In-Memory Lookup Table-Based Accelerator Empowers Edge Large Language Model Inference

PALUTE uses monolithic 3D DRAM to implement in-memory lookup table queries, achieving a throughput of 1264 TPS at 0.16W power consumption. It delivers 12.8x higher energy efficiency than existing solutions, providing an efficient approach for deploying LLMs on edge devices.

大语言模型边缘推理存内计算查找表三维DRAMAI加速器低功耗量化推理

Published 2026-06-08 08:33Recent activity 2026-06-09 10:52Estimated read 5 min

PALUTE: In-Memory Lookup Table-Based Accelerator Empowers Edge Large Language Model Inference

Section 01

[Main Floor/Introduction] PALUTE: In-Memory Lookup Table-Based Edge LLM Inference Accelerator

PALUTE is an in-memory computing accelerator designed for edge large language model (LLM) inference. Its core innovation lies in using monolithic 3D DRAM (M3D DRAM) to enable in-memory lookup table (LUT) queries. It achieves a throughput of 1264 TPS at 0.16W power consumption and 12.8x higher energy efficiency than existing solutions, offering an efficient solution for deploying LLMs on edge devices.

Original authors: arXiv authors | Source: arXiv (2026-06-08) | Paper link: http://arxiv.org/abs/2606.08891v1

Section 02

Background: Core Challenges of Edge LLM Inference

The demand for LLMs on edge devices (e.g., mobile phones, IoT devices) is growing, but it faces three key constraints:

Tight power budget (mobile devices have an upper limit of only a few watts);
Limited chip area (impacting cost and heat dissipation);
Memory bandwidth bottleneck (far weaker than data centers).

Traditional low-bit quantization schemes reduce storage and computation loads but introduce overhead from dequantization and nonlinear operations, creating a new bottleneck.

Section 03

Method: Architectural Innovations of PALUTE

PALUTE combines LUT methods with M3D DRAM technology. Key designs include:

M3D DRAM Vertical Organization: Uses vertically stacked storage layers to support high-parallel lookup and reduce area overhead;
Near-Memory LUT Generator: Quickly generates LUTs for GEMM/nonlinear operators, with dynamic updates to avoid static table capacity pressure;
System-Level Scheduling: Intelligently predicts access patterns, prefetches data, and minimizes cross-layer data movement.

Section 04

Evidence: Performance and Energy Efficiency of PALUTE

Tested on the Qwen3-4B model (W4A4 quantization):

Throughput: 1264 TPS;
Power consumption: 0.16W;
Energy efficiency comparison: 12.8x higher than CHIME, 1.6x higher than FIGLUT;
Area efficiency: 2.0x higher than PIMPAL.

Section 05

Technical Details: In-Memory Computing and LUT Optimization

In-Memory Computing Advantages: Reduces data movement energy consumption (data movement energy in traditional architectures is far higher than computation);
LUT Compression Coding: Uses differential coding, piecewise linear approximation, and adaptive precision to optimize storage efficiency;
Quantization Collaboration: Optimized for W4A4 low-bit scenarios, leveraging quantization regularity.

Section 06

Application Scenarios: Edge Deployment Directions of PALUTE

Suitable for:

Smartphone on-device AI (offline translation, privacy document processing);
IoT and edge gateways (industrial quality inspection, intelligent monitoring);
Autonomous driving and robots (real-time perception and decision-making).

Section 07

Limitations and Future Outlook

Current Limitations:

Model scale: Only verified on 4B parameter models; scalability for larger models needs validation;
Versatility: Optimized for Transformers; other networks require adjustments;
Process dependency: M3D DRAM maturity affects deployment.

Future Directions:

Support 7B/13B models;
Multimodal expansion;
Dynamic precision adjustment;
Improve software stack (compiler, runtime).

Section 08

Conclusion: Edge AI Value of PALUTE

PALUTE combines LUT and M3D DRAM to resolve the power-performance contradiction in edge LLM inference, marking an important advancement in edge AI accelerators. As hardware matures and software improves, smooth operation of large models on edge devices will become the norm in the future.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49