Zing Forum

Reading

RAM Coffers: NUMA Distributed Weight Bank Architecture Achieves 8.8x Speedup in CPU-side LLM Inference

An innovative architecture on IBM POWER8 that enables O(1) knowledge retrieval via NUMA-aware conditional memory and resonance routing, achieving 147 tokens/sec without GPU—8.8x faster than standard llama.cpp.

NUMALLM推理CPU优化IBM POWER8内存架构权重银行共振路由DeepSeekDePIN
Published 2026-05-18 23:45Recent activity 2026-05-19 00:22Estimated read 5 min
RAM Coffers: NUMA Distributed Weight Bank Architecture Achieves 8.8x Speedup in CPU-side LLM Inference
1

Section 01

[Introduction] RAM Coffers: An Innovative Architecture for 8.8x Speedup in CPU-side LLM Inference

RAM Coffers is an open-source project on IBM POWER8 servers. Using technologies like the NUMA distributed weight bank architecture and resonance routing, it achieves CPU-side LLM inference at 147 tokens/sec without a GPU—8.8x faster than standard llama.cpp. This achievement breaks through hardware utilization efficiency and reveals the potential of traditional CPUs in LLM inference.

2

Section 02

Background: The Potential of CPU Inference Beyond GPUs

In the field of LLM inference, GPUs are almost the standard. The RAM Coffers project challenges this assumption, relying entirely on CPU and memory architecture optimizations to achieve efficient inference. Its results show that with sophisticated memory design, traditional CPUs can also exhibit remarkable performance potential.

3

Section 03

Core Innovations: NUMA Weight Bank and Resonance Routing Technologies

The core architecture of RAM Coffers is the NUMA distributed weight bank, which partitions model weights into different NUMA nodes by domain (four Coffer regions: core general knowledge, science and technology, creative long context, and niche historical knowledge). The resonance routing technology routes queries to the appropriate Coffer via cosine matching of embedding vectors, enabling O(1) knowledge retrieval, and binds threads to target NUMA nodes to maximize memory locality.

4

Section 04

Optimization Details: Non-Bijective Pruning and DCBT Prefetching Strategy

To reduce memory bandwidth requirements, RAM Coffers introduces non-bijective pruning technology, which selectively prunes and retains key parts before loading weights. Combined with PowerPC's DCBT instruction to prefetch data into the cache, it reduces cache misses and maintains a throughput of 147 tokens/sec.

5

Section 05

Related Findings: Coincidence with DeepSeek and Byproduct Technologies

The initial version of RAM Coffers was 27 days earlier than the DeepSeek Engram paper, and both share similar core ideas (separation of static knowledge storage and dynamic computation, O(1) retrieval), verifying the rationality of the direction. During development, PSE hardware entropy (injecting hardware randomness to improve output diversity) and GRAIL-V emotional prompt translation (20%-33% efficiency improvement in video tasks) were also discovered.

6

Section 06

Practical Deployment: DePIN Integration and Economic Return Model

RAM Coffers has been integrated into the physical AI proof technology stack. IBM POWER8 servers can run LLM inference while mining RTC tokens via the ancient proof consensus, becoming DePIN nodes and providing a solution to convert sunk costs of idle enterprise servers.

7

Section 07

Technical Limitations and Future Outlook

Technical Limitations: Relies on NUMA architecture, making it difficult to reproduce on ordinary consumer hardware; weight partitioning requires model-specific tuning, lacking generality. Future Outlook: The development of memory interconnection technologies like CXL may enable ordinary hardware to acquire NUMA-like capabilities, and the architectural ideas are expected to be more widely applied.

8

Section 08

Conclusion: Another Path to Unleash the Potential of Existing Hardware

RAM Coffers reminds us that besides pursuing large models and computing power, intelligent architecture design can unleash the potential of existing hardware. Behind the 147 tokens/sec is the courage to rethink the relationship between computation and storage, driving technological progress.