# hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

> hipfire is an LLM inference engine optimized specifically for AMD RDNA architecture GPUs. Written in Rust, it eliminates dependencies on Python runtime and ROCm linking, achieving faster generation speeds than llama.cpp on consumer GPUs like the RX 5700 XT.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-29T23:43:28.000Z
- 最近活动: 2026-03-29T23:57:31.210Z
- 热度: 163.8
- 关键词: AMD, RDNA, GPU, Rust, LLM, 推理, 量化, Qwen, DeltaNet, 开源
- 页面链接: https://www.zingnex.cn/en/forum/thread/hipfire-amd-rdna-gpurustllm
- Canonical: https://www.zingnex.cn/forum/thread/hipfire-amd-rdna-gpurustllm
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

hipfire is an LLM inference engine optimized specifically for AMD RDNA architecture GPUs. Written in Rust, it eliminates dependencies on Python runtime and ROCm linking, achieving faster generation speeds than llama.cpp on consumer GPUs like the RX 5700 XT.

## Project Background and Motivation

In the AI inference field, NVIDIA's CUDA ecosystem has long dominated, while AMD GPU users often face challenges like incomplete toolchains and insufficient performance optimization. hipfire fills this gap—it is an LLM inference engine designed from scratch for AMD RDNA architecture GPUs, written in Rust, and completely free from dependencies on Python runtime and ROCm linking. The core philosophy of hipfire is "RDNA-native": deep optimization for the hardware characteristics of AMD GPUs, rather than simple porting of CUDA solutions. This design philosophy allows it to achieve surprising inference performance even on consumer GPUs.

## 1. Pure Rust Implementation and Zero-Dependency Design

hipfire uses a pure Rust codebase, dynamically loading libamdhip64.so at runtime via dlopen, eliminating the need for ROCm linking during compilation. This design offers multiple advantages:

- **Simplified Deployment**: No need to configure complex ROCm development environments
- **Compact Size**: No Python interpreter or heavy dependencies like PyTorch
- **Fast Startup**: Significantly reduced cold start time
- **Memory Safety**: Rust's ownership system eliminates risks of memory leaks and segmentation faults

## 2. HFQ Quantization Format and GEMV Optimization

hipfire introduces the proprietary HFQ (HipFire Quantized) quantization format, optimized for the register pressure of RDNA architecture:

- **HFQ4 Format**: Each 256-weight block requires only 136 bytes of storage (f32 scaling factor + f32 zero point + 128 bytes of packed data)
- **Low Register Usage**: The GEMV kernel uses only 18 VGPRs, half the number used by llama.cpp's Q4_K (39 VGPRs)
- **Higher Concurrency**: Lower register pressure means more concurrent wavefronts and better memory latency hiding
- **Measured Bandwidth**: Effective bandwidth reaches 282 GB/s, far exceeding llama.cpp's ~210 GB/s

## 3. TurboQuant KV Cache Compression

KV cache is the memory bottleneck for long-context inference. hipfire's TurboQuant technology achieves aggressive compression via FWHT (Fast Walsh-Hadamard Transform):

| Configuration | Compression Ratio | Generation Speed | Output Quality |
|------|--------|----------|----------|
| Q8 (default) | 3.88x | 59.9 tok/s | Good |
| turbo4 (4-bit) | 7.5x | 54.5 tok/s | Good |
| turbo3 (3-bit) | 9.85x | 52.0 tok/s | Good |
| turbo2 (2-bit) | 14.2x | 55.1 tok/s | Good |

The core innovation of TurboQuant is **norm-corrected quantization**:
- Normalize each KV vector to unit L2 norm
- Perform FWHT rotation via register-level __shfl_xor operations (zero shared memory barriers)
- Quantize to optimal centroids using the Lloyd-Max algorithm
- Store the ratio of original norm to reconstructed norm for correction

This design ensures precise L2 norm preservation and decorrelated quantization errors, allowing 2-bit compression to maintain semantic coherence.

## 4. Qwen3.5 DeltaNet Support

hipfire is the first to implement inference support for Qwen3.5 series DeltaNet models, including 0.8B/2B/4B/9B parameter versions. DeltaNet uses a gated linear attention mechanism, precisely mapping the 128x128 state matrix into the 64KB LDS of RDNA1, achieving:

- **190 tok/s** generation speed (Qwen3.5-0.8B)
- Support for Q8 and FP32 state quantization
- Efficient update of recursive S states

## Performance Benchmarks

Measured data on AMD RX 5700 XT (gfx1010, RDNA1, 8GB GDDR6, released in 2019, ~$200):

## Text Generation Speed (tok/s)

| Model | hipfire | llama.cpp | Speedup |
|------|---------|-----------|--------|
| Qwen3-8B | 59.9 | 44.3 | 1.35x |
| Qwen3-8B Long Text | 52.7 | 42.8 | 1.23x |
| Qwen3-0.6B | 262 | 193.6 | 1.35x |
| Qwen3.5-0.8B DeltaNet | 190 | N/A | - |
