# Exploring the Limits of Framework Desktop Inference: Practical Large Model Optimization on the Strix Halo Platform

> A months-long in-depth research project that optimized large model inference using llama.cpp RPC on the AMD Strix Halo platform (Framework Desktop) and RTX 3090. It completed 34 tasks covering cutting-edge technologies such as KV cache compression, prefix caching, Flash Attention, mixed-precision quantization, NPU experiments, and heterogeneous RPC inference.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-20T09:45:39.000Z
- 最近活动: 2026-04-20T09:52:48.134Z
- 热度: 145.9
- 关键词: Strix Halo, Framework Desktop, LLM推理, llama.cpp, RPC, 异构计算, KV缓存, 投机解码, AMD, 量化优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/framework-desktop-strix-halo
- Canonical: https://www.zingnex.cn/forum/thread/framework-desktop-strix-halo
- Markdown 来源: floors_fallback

---

## [Introduction] Exploring the Limits of Framework Desktop Large Model Inference: Practical Optimization on the Strix Halo Platform

This research project focuses on the Framework Desktop platform with AMD Strix Halo architecture, combining RTX 3090 to optimize large model inference via llama.cpp RPC. It completed 34 tasks covering cutting-edge technologies like KV cache compression, speculative decoding, and heterogeneous RPC inference, exploring the limits of desktop-level LLM inference and challenging the traditional reliance on data center GPUs.

## [Research Background and Test Environment]

### Research Background
As LLM scales grow, inference efficiency has become a bottleneck for deployment, with traditional reliance on expensive data center GPUs. The Framework Desktop with AMD Strix Halo architecture (Ryzen AI MAX+395, Radeon 8060S iGPU, 128GB unified memory) provides an ideal platform for desktop-level inference.
### Test Environment
- **Main Node**: Framework Desktop (Ryzen AI MAX+395, Radeon 8060S, 128GB LPDDR5X, Vulkan/ROCm backend)
- **Companion Node**: RTX 3090 (24GB GDDR6X, CUDA 12.8)
- **Software Stack**: llama.cpp (b8775/b8779), RPC over Wi-Fi

## [Core Optimization Methods and Technical Exploration]

### Key Task Exploration
1. **KV Cache**: Tested 14 Pareto-optimal configurations to balance context length and speed
2. **Speculative Decoding**: Used a 0.8B draft model to accelerate the 122B target model, increasing decoding speed by 1.98x
3. **Parallel Throughput**: Aggregate throughput increased by 2.21x when npl=8
4. **Comprehensive Optimization**: Q4_K_M quantization + ubatch=2048 + parallel slots achieved an aggregate throughput of 60.54 tok/s
5. **Thermal Sustainability**: Throughput drift was only -0.08% after 60 minutes of operation
6. **Heterogeneous RPC**: Split the Qwen3.5-122B model across AMD + NVIDIA GPUs, with only a 4.3% decrease in decoding speed
### Technical Depth
- Unified Memory Architecture: Shared 128GB memory supports larger models and zero-copy transfer
- rocWMMA Flash Attention: Reduces memory bandwidth requirements
- Mixed-Precision Quantization: Established a trade-off curve between quantization levels and quality
- NPU Experiments: Explored the potential of Neural Processing Units (NPUs) in LLM inference

## [Key Experimental Data and Reproducibility]

### Core Data
- Phase0: ROCm + MMQ prefill at 406 tok/s, decoding at 40.1 tok/s; chat load improved by 47% compared to Vulkan
- Mission01: f16/f16 KV precision supports 131K token context, prefill at 152.76 tok/s
- Mission34: Successfully loaded the 129GB MiniMax-M2.5 model (RTX3090 uses 22.1GB, Radeon8060S uses 109.5GB)
### Reproducibility Design
- Environment variable-driven configuration
- Task-level detailed documentation
- Raw data (JSON/CSV) made public
- Runnable test scripts open-sourced (MIT license)

## [Research Conclusions and Industry Significance]

### Core Conclusions
1. Desktop integrated GPU platforms can handle serious large model inference; 128GB unified memory supports models with over 100B parameters
2. Heterogeneous RPC inference validates the feasibility of cross-vendor GPU collaboration
3. Submitted fixes and optimization suggestions to the llama.cpp upstream
### Industry Significance
- Promotes AI democratization: Reduces local inference costs and supports privacy-sensitive/offline scenarios
- Demonstrates heterogeneous computing: Provides new ideas for ultra-large-scale model inference
- Open-source contributions: Publishes data and scripts to support community development

## [Limitations and Future Optimization Directions]

### Current Limitations
1. Wi-Fi RPC introduces latency; wired connections may improve performance
2. ROCm ecosystem maturity lags behind CUDA
3. Long-term high load poses challenges to heat dissipation
### Future Directions
1. Expand testing to latest models like Llama3 and Qwen3
2. Explore new GGUF quantization schemes
3. Try multi-node RPC clusters
4. Develop a dedicated deployment toolchain for Strix Halo
