Reading

RAM Coffers: NUMA Distributed Weight Bank Architecture Achieves 8.8x Speedup in CPU-side LLM Inference

An innovative architecture on IBM POWER8 that enables O(1) knowledge retrieval via NUMA-aware conditional memory and resonance routing, achieving 147 tokens/sec without GPU—8.8x faster than standard llama.cpp.

NUMALLM推理CPU优化IBM POWER8内存架构权重银行共振路由DeepSeekDePIN

Published 2026-05-18 23:45Recent activity 2026-05-19 00:22Estimated read 5 min

RAM Coffers: NUMA Distributed Weight Bank Architecture Achieves 8.8x Speedup in CPU-side LLM Inference

Section 01

[Introduction] RAM Coffers: An Innovative Architecture for 8.8x Speedup in CPU-side LLM Inference

RAM Coffers is an open-source project on IBM POWER8 servers. Using technologies like the NUMA distributed weight bank architecture and resonance routing, it achieves CPU-side LLM inference at 147 tokens/sec without a GPU—8.8x faster than standard llama.cpp. This achievement breaks through hardware utilization efficiency and reveals the potential of traditional CPUs in LLM inference.

Section 02

Background: The Potential of CPU Inference Beyond GPUs

In the field of LLM inference, GPUs are almost the standard. The RAM Coffers project challenges this assumption, relying entirely on CPU and memory architecture optimizations to achieve efficient inference. Its results show that with sophisticated memory design, traditional CPUs can also exhibit remarkable performance potential.

Section 03

Core Innovations: NUMA Weight Bank and Resonance Routing Technologies

The core architecture of RAM Coffers is the NUMA distributed weight bank, which partitions model weights into different NUMA nodes by domain (four Coffer regions: core general knowledge, science and technology, creative long context, and niche historical knowledge). The resonance routing technology routes queries to the appropriate Coffer via cosine matching of embedding vectors, enabling O(1) knowledge retrieval, and binds threads to target NUMA nodes to maximize memory locality.

Section 04

Optimization Details: Non-Bijective Pruning and DCBT Prefetching Strategy

To reduce memory bandwidth requirements, RAM Coffers introduces non-bijective pruning technology, which selectively prunes and retains key parts before loading weights. Combined with PowerPC's DCBT instruction to prefetch data into the cache, it reduces cache misses and maintains a throughput of 147 tokens/sec.

Section 05

Related Findings: Coincidence with DeepSeek and Byproduct Technologies

The initial version of RAM Coffers was 27 days earlier than the DeepSeek Engram paper, and both share similar core ideas (separation of static knowledge storage and dynamic computation, O(1) retrieval), verifying the rationality of the direction. During development, PSE hardware entropy (injecting hardware randomness to improve output diversity) and GRAIL-V emotional prompt translation (20%-33% efficiency improvement in video tasks) were also discovered.

Section 06

Practical Deployment: DePIN Integration and Economic Return Model

RAM Coffers has been integrated into the physical AI proof technology stack. IBM POWER8 servers can run LLM inference while mining RTC tokens via the ancient proof consensus, becoming DePIN nodes and providing a solution to convert sunk costs of idle enterprise servers.

Section 07

Technical Limitations and Future Outlook

Technical Limitations: Relies on NUMA architecture, making it difficult to reproduce on ordinary consumer hardware; weight partitioning requires model-specific tuning, lacking generality. Future Outlook: The development of memory interconnection technologies like CXL may enable ordinary hardware to acquire NUMA-like capabilities, and the architectural ideas are expected to be more widely applied.

Section 08

Conclusion: Another Path to Unleash the Potential of Existing Hardware

RAM Coffers reminds us that besides pursuing large models and computing power, intelligent architecture design can unleash the potential of existing hardware. Behind the 147 tokens/sec is the courage to rethink the relationship between computation and storage, driving technological progress.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15