Reading

Team Red LLM: A Practical Guide and Benchmark Database for Local LLM Inference on AMD GPUs

A community-maintained guide for local LLM deployment on AMD GPUs, covering detailed steps for ROCm/HIP inference, common pitfalls, real performance benchmark data, and supporting consumer Radeon, data center Instinct, and Strix Halo APU.

AMD GPUROCmLLM inferenceRadeonlocal AIbenchmarkopen source

Published 2026-05-02 08:07Recent activity 2026-05-02 09:46Estimated read 5 min

Team Red LLM: A Practical Guide and Benchmark Database for Local LLM Inference on AMD GPUs

Section 01

Team Red LLM: AMD GPU Local LLM Inference Guide & Benchmark Database

Team Red LLM is a community-maintained project focusing on AMD GPU local LLM deployment. It provides detailed ROCm/HIP inference steps, common pitfalls, real performance benchmarks, and supports consumer Radeon, data center Instinct, and Strix Halo APU. The project aims to help AMD users avoid ROCm-related pitfalls and improve their local AI experience.

Section 02

Project Background: Addressing CUDA Centralization for AMD Users

The current LLM open-source ecosystem is CUDA-centric—most tutorials assume NVIDIA GPUs, toolchains often fallback to CPU on ROCm detection, and solutions are scattered in old Reddit posts. AMD users face a 'second-class citizen' experience. Team Red LLM was created as a community cookbook and benchmark database to solve these issues, allowing developers to share ROCm pitfalls and help others.

Section 03

Core Content: Modular Structure for Easy Access

The project uses a modular structure:

COOKBOOK.md: Step-by-step deployment guide with verified steps and marked pitfalls.
benchmarks/results.csv: Community-contributed token generation speed data (not vendor marketing).
hardware/: Per-GPU docs (BIOS, drivers,散热, best models).
models/: Model-specific configs (launch params, problematic quant formats, MoE offload tips).
scripts/: Wrapper scripts for llama-server, model switching, and benchmarking.

Section 04

Benchmark Evidence: RX7900 GRE Performance

The project provides RX7900 GRE 16GB performance data:

GPU	架构	模型	量化	模式	生成 tok/s	提示 tok/s
RX7900 GRE	gfx1100	Moonlight-16B-A3B-Instruct	Q6_K	Full GPU	100.2	188.1
RX7900 GRE	gfx1100	gemma-4-26B-A4B-it	UD-Q4_K_M	MoE offload (-ncmoe6)	31.0	61.3
RX7900 GRE	gfx1100	Qwen3.6-35B-A3B-UD	Q4_K_S	MoE offload (-ncmoe32)	22.7	41.7
RX7900 GRE	gfx1100	gemma-4-26B-A4B-it	UD-Q6_K	MoE offload (-ncmoe16)	17.3	80.7

Modes: Full GPU (max speed, VRAM-limited) vs MoE offload (runs larger models but 3-5x slower due to DDR5 bandwidth).

Section 05

AMD GPU & ROCm Support Matrix

The project maintains a support matrix:

Code	架构家族	示例型号	ROCm 支持状态
gfx1100	RDNA3	RX7900XTX/XT/GRE	✅成熟
gfx1101	RDNA3	RX7800XT/7700XT	✅可用
gfx1102	RDNA3	RX7600	⚠️部分支持
gfx1200/1201	RDNA4	RX9070XT/9060	✅近期支持
gfx1150/1151	Strix Halo	Ryzen AI Max+395	⚠️前沿支持
gfx942/950	CDNA3	MI300X/MI325X	✅数据中心

Strix Halo APU needs manual patches or specific kernel versions for ROCm support.

Section 06

Community Contribution Ways

Contribute via:

Benchmarks: Submit PR to results.csv or use GitHub Issue template.
Pitfalls: Add to COOKBOOK.md.
Models: Create model-specific docs in models/.
Hardware: Add GPU docs in hardware/.

GitHub Discussions for general topics (hardware suggestions, 'worth it' questions); Issues for bugs.

Section 07

Significance for Local AI Ecosystem

Team Red LLM proves AMD hardware can handle local LLM inference via community collaboration. It lowers entry barriers for budget users (e.g., used RX6000/7000 series) and reduces AMD users' reliance on cloud APIs. The project is MIT-licensed, encouraging free use and modification.

Team Red LLM: A Practical Guide and Benchmark Database for Local LLM Inference on AMD GPUs

Team Red LLM: AMD GPU Local LLM Inference Guide & Benchmark Database

Project Background: Addressing CUDA Centralization for AMD Users

Core Content: Modular Structure for Easy Access

Benchmark Evidence: RX7900 GRE Performance

AMD GPU & ROCm Support Matrix

Community Contribution Ways

Significance for Local AI Ecosystem

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model