Reading

hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

hipfire is an LLM inference engine optimized specifically for AMD RDNA architecture GPUs. Written in Rust, it eliminates dependencies on Python runtime and ROCm linking, achieving faster generation speeds than llama.cpp on consumer GPUs like the RX 5700 XT.

AMDRDNAGPURustLLM推理量化QwenDeltaNet开源

Published 2026-03-30 07:43Recent activity 2026-03-30 07:57Estimated read 6 min

Section 01

Introduction / Main Post: hipfire: A Rust-native LLM Inference Engine for AMD RDNA GPUs

Section 02

Project Background and Motivation

In the AI inference field, NVIDIA's CUDA ecosystem has long dominated, while AMD GPU users often face challenges like incomplete toolchains and insufficient performance optimization. hipfire fills this gap—it is an LLM inference engine designed from scratch for AMD RDNA architecture GPUs, written in Rust, and completely free from dependencies on Python runtime and ROCm linking. The core philosophy of hipfire is "RDNA-native": deep optimization for the hardware characteristics of AMD GPUs, rather than simple porting of CUDA solutions. This design philosophy allows it to achieve surprising inference performance even on consumer GPUs.

Section 03

1. Pure Rust Implementation and Zero-Dependency Design

hipfire uses a pure Rust codebase, dynamically loading libamdhip64.so at runtime via dlopen, eliminating the need for ROCm linking during compilation. This design offers multiple advantages:

Simplified Deployment: No need to configure complex ROCm development environments
Compact Size: No Python interpreter or heavy dependencies like PyTorch
Fast Startup: Significantly reduced cold start time
Memory Safety: Rust's ownership system eliminates risks of memory leaks and segmentation faults

Section 04

2. HFQ Quantization Format and GEMV Optimization

hipfire introduces the proprietary HFQ (HipFire Quantized) quantization format, optimized for the register pressure of RDNA architecture:

HFQ4 Format: Each 256-weight block requires only 136 bytes of storage (f32 scaling factor + f32 zero point + 128 bytes of packed data)
Low Register Usage: The GEMV kernel uses only 18 VGPRs, half the number used by llama.cpp's Q4_K (39 VGPRs)
Higher Concurrency: Lower register pressure means more concurrent wavefronts and better memory latency hiding
Measured Bandwidth: Effective bandwidth reaches 282 GB/s, far exceeding llama.cpp's ~210 GB/s

Section 05

3. TurboQuant KV Cache Compression

KV cache is the memory bottleneck for long-context inference. hipfire's TurboQuant technology achieves aggressive compression via FWHT (Fast Walsh-Hadamard Transform):

Configuration	Compression Ratio	Generation Speed	Output Quality
Q8 (default)	3.88x	59.9 tok/s	Good
turbo4 (4-bit)	7.5x	54.5 tok/s	Good
turbo3 (3-bit)	9.85x	52.0 tok/s	Good
turbo2 (2-bit)	14.2x	55.1 tok/s	Good

The core innovation of TurboQuant is norm-corrected quantization:

Normalize each KV vector to unit L2 norm
Perform FWHT rotation via register-level __shfl_xor operations (zero shared memory barriers)
Quantize to optimal centroids using the Lloyd-Max algorithm
Store the ratio of original norm to reconstructed norm for correction

This design ensures precise L2 norm preservation and decorrelated quantization errors, allowing 2-bit compression to maintain semantic coherence.

Section 06

4. Qwen3.5 DeltaNet Support

hipfire is the first to implement inference support for Qwen3.5 series DeltaNet models, including 0.8B/2B/4B/9B parameter versions. DeltaNet uses a gated linear attention mechanism, precisely mapping the 128x128 state matrix into the 64KB LDS of RDNA1, achieving:

190 tok/s generation speed (Qwen3.5-0.8B)
Support for Q8 and FP32 state quantization
Efficient update of recursive S states

Section 07

Performance Benchmarks

Measured data on AMD RX 5700 XT (gfx1010, RDNA1, 8GB GDDR6, released in 2019, ~$200):

Section 08

Text Generation Speed (tok/s)

Model	hipfire	llama.cpp	Speedup
Qwen3-8B	59.9	44.3	1.35x
Qwen3-8B Long Text	52.7	42.8	1.23x
Qwen3-0.6B	262	193.6	1.35x
Qwen3.5-0.8B DeltaNet	190	N/A	-

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15