# RAM Coffers: NUMA Distributed Weight Bank Architecture Achieves 8.8x Speedup in CPU-side LLM Inference

> An innovative architecture on IBM POWER8 that enables O(1) knowledge retrieval via NUMA-aware conditional memory and resonance routing, achieving 147 tokens/sec without GPU—8.8x faster than standard llama.cpp.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T15:45:15.000Z
- 最近活动: 2026-05-18T16:22:14.431Z
- 热度: 161.4
- 关键词: NUMA, LLM推理, CPU优化, IBM POWER8, 内存架构, 权重银行, 共振路由, DeepSeek, DePIN
- 页面链接: https://www.zingnex.cn/en/forum/thread/ram-coffers-ibm-power8numa
- Canonical: https://www.zingnex.cn/forum/thread/ram-coffers-ibm-power8numa
- Markdown 来源: floors_fallback

---

## [Introduction] RAM Coffers: An Innovative Architecture for 8.8x Speedup in CPU-side LLM Inference

RAM Coffers is an open-source project on IBM POWER8 servers. Using technologies like the NUMA distributed weight bank architecture and resonance routing, it achieves CPU-side LLM inference at 147 tokens/sec without a GPU—8.8x faster than standard llama.cpp. This achievement breaks through hardware utilization efficiency and reveals the potential of traditional CPUs in LLM inference.

## Background: The Potential of CPU Inference Beyond GPUs

In the field of LLM inference, GPUs are almost the standard. The RAM Coffers project challenges this assumption, relying entirely on CPU and memory architecture optimizations to achieve efficient inference. Its results show that with sophisticated memory design, traditional CPUs can also exhibit remarkable performance potential.

## Core Innovations: NUMA Weight Bank and Resonance Routing Technologies

The core architecture of RAM Coffers is the NUMA distributed weight bank, which partitions model weights into different NUMA nodes by domain (four Coffer regions: core general knowledge, science and technology, creative long context, and niche historical knowledge). The resonance routing technology routes queries to the appropriate Coffer via cosine matching of embedding vectors, enabling O(1) knowledge retrieval, and binds threads to target NUMA nodes to maximize memory locality.

## Optimization Details: Non-Bijective Pruning and DCBT Prefetching Strategy

To reduce memory bandwidth requirements, RAM Coffers introduces non-bijective pruning technology, which selectively prunes and retains key parts before loading weights. Combined with PowerPC's DCBT instruction to prefetch data into the cache, it reduces cache misses and maintains a throughput of 147 tokens/sec.

## Related Findings: Coincidence with DeepSeek and Byproduct Technologies

The initial version of RAM Coffers was 27 days earlier than the DeepSeek Engram paper, and both share similar core ideas (separation of static knowledge storage and dynamic computation, O(1) retrieval), verifying the rationality of the direction. During development, PSE hardware entropy (injecting hardware randomness to improve output diversity) and GRAIL-V emotional prompt translation (20%-33% efficiency improvement in video tasks) were also discovered.

## Practical Deployment: DePIN Integration and Economic Return Model

RAM Coffers has been integrated into the physical AI proof technology stack. IBM POWER8 servers can run LLM inference while mining RTC tokens via the ancient proof consensus, becoming DePIN nodes and providing a solution to convert sunk costs of idle enterprise servers.

## Technical Limitations and Future Outlook

Technical Limitations: Relies on NUMA architecture, making it difficult to reproduce on ordinary consumer hardware; weight partitioning requires model-specific tuning, lacking generality. Future Outlook: The development of memory interconnection technologies like CXL may enable ordinary hardware to acquire NUMA-like capabilities, and the architectural ideas are expected to be more widely applied.

## Conclusion: Another Path to Unleash the Potential of Existing Hardware

RAM Coffers reminds us that besides pursuing large models and computing power, intelligent architecture design can unleash the potential of existing hardware. Behind the 147 tokens/sec is the courage to rethink the relationship between computation and storage, driving technological progress.