Zing Forum

Reading

hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

hipfire is an LLM inference engine optimized specifically for AMD RDNA architecture GPUs. Written in Rust, it eliminates dependencies on Python runtime and ROCm linking, achieving faster generation speeds than llama.cpp on consumer GPUs like the RX 5700 XT.

AMDRDNAGPURustLLM推理量化QwenDeltaNet开源
Published 2026-03-30 07:43Recent activity 2026-03-30 07:57Estimated read 6 min
hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs
1

Section 01

Introduction / Main Post: hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

hipfire is an LLM inference engine optimized specifically for AMD RDNA architecture GPUs. Written in Rust, it eliminates dependencies on Python runtime and ROCm linking, achieving faster generation speeds than llama.cpp on consumer GPUs like the RX 5700 XT.

2

Section 02

Project Background and Motivation

In the AI inference field, NVIDIA's CUDA ecosystem has long dominated, while AMD GPU users often face challenges like incomplete toolchains and insufficient performance optimization. hipfire fills this gap—it is an LLM inference engine designed from scratch for AMD RDNA architecture GPUs, written in Rust, and completely free from dependencies on Python runtime and ROCm linking. The core philosophy of hipfire is "RDNA-native": deep optimization for the hardware characteristics of AMD GPUs, rather than simple porting of CUDA solutions. This design philosophy allows it to achieve surprising inference performance even on consumer GPUs.

3

Section 03

1. Pure Rust Implementation and Zero-Dependency Design

hipfire uses a pure Rust codebase, dynamically loading libamdhip64.so at runtime via dlopen, eliminating the need for ROCm linking during compilation. This design offers multiple advantages:

  • Simplified Deployment: No need to configure complex ROCm development environments
  • Compact Size: No Python interpreter or heavy dependencies like PyTorch
  • Fast Startup: Significantly reduced cold start time
  • Memory Safety: Rust's ownership system eliminates risks of memory leaks and segmentation faults
4

Section 04

2. HFQ Quantization Format and GEMV Optimization

hipfire introduces the proprietary HFQ (HipFire Quantized) quantization format, optimized for the register pressure of RDNA architecture:

  • HFQ4 Format: Each 256-weight block requires only 136 bytes of storage (f32 scaling factor + f32 zero point + 128 bytes of packed data)
  • Low Register Usage: The GEMV kernel uses only 18 VGPRs, half the number used by llama.cpp's Q4_K (39 VGPRs)
  • Higher Concurrency: Lower register pressure means more concurrent wavefronts and better memory latency hiding
  • Measured Bandwidth: Effective bandwidth reaches 282 GB/s, far exceeding llama.cpp's ~210 GB/s
5

Section 05

3. TurboQuant KV Cache Compression

KV cache is the memory bottleneck for long-context inference. hipfire's TurboQuant technology achieves aggressive compression via FWHT (Fast Walsh-Hadamard Transform):

Configuration Compression Ratio Generation Speed Output Quality
Q8 (default) 3.88x 59.9 tok/s Good
turbo4 (4-bit) 7.5x 54.5 tok/s Good
turbo3 (3-bit) 9.85x 52.0 tok/s Good
turbo2 (2-bit) 14.2x 55.1 tok/s Good

The core innovation of TurboQuant is norm-corrected quantization:

  • Normalize each KV vector to unit L2 norm
  • Perform FWHT rotation via register-level __shfl_xor operations (zero shared memory barriers)
  • Quantize to optimal centroids using the Lloyd-Max algorithm
  • Store the ratio of original norm to reconstructed norm for correction

This design ensures precise L2 norm preservation and decorrelated quantization errors, allowing 2-bit compression to maintain semantic coherence.

6

Section 06

4. Qwen3.5 DeltaNet Support

hipfire is the first to implement inference support for Qwen3.5 series DeltaNet models, including 0.8B/2B/4B/9B parameter versions. DeltaNet uses a gated linear attention mechanism, precisely mapping the 128x128 state matrix into the 64KB LDS of RDNA1, achieving:

  • 190 tok/s generation speed (Qwen3.5-0.8B)
  • Support for Q8 and FP32 state quantization
  • Efficient update of recursive S states
7

Section 07

Performance Benchmarks

Measured data on AMD RX 5700 XT (gfx1010, RDNA1, 8GB GDDR6, released in 2019, ~$200):

8

Section 08

Text Generation Speed (tok/s)

Model hipfire llama.cpp Speedup
Qwen3-8B 59.9 44.3 1.35x
Qwen3-8B Long Text 52.7 42.8 1.23x
Qwen3-0.6B 262 193.6 1.35x
Qwen3.5-0.8B DeltaNet 190 N/A -