Reading

viiwork: The Load Balancing Tool That Turns Old AMD GPUs Into LLM Inference Clusters

viiwork is an LLM inference load balancer designed specifically for old GPUs like the AMD Radeon VII. It can cluster multiple GPUs with 16GB HBM2 memory, provide an OpenAI-compatible API interface, and breathe new life into legacy hardware.

viiworkAMDRadeon VIILLM推理负载均衡ROCmllama.cppGPU集群开源Mesh集群

Published 2026-04-06 01:43Recent activity 2026-04-06 01:51Estimated read 5 min

Section 01

[Introduction] viiwork: The Load Balancing Tool That Turns Old AMD GPUs Into LLM Inference Clusters

viiwork is an open-source LLM inference load balancer designed for old gfx906 architecture GPUs like the AMD Radeon VII. It can form a Mesh cluster with multiple GPUs equipped with 16GB HBM2 memory and provide an OpenAI-compatible API interface. It taps into the potential of legacy hardware, offering a low-cost inference solution for users on a budget. Key features include intelligent model recommendation, real-time electricity cost monitoring, and Pipeline chained inference, among others.

Section 02

Project Background: 50 GPUs in the Mother-in-Law's Garage and the Discovery of Old Hardware Value

viiwork was born from the author's desire to utilize 50 Radeon VII GPUs in his mother-in-law's garage. Although gfx906 architecture GPUs like the Radeon VII, Instinct MI50/MI60 are old, they are equipped with 16GB/32GB HBM2 memory and 1TB/s bandwidth. The bottleneck for LLM inference is often memory bandwidth rather than computing power, so these old GPUs are still capable of handling inference tasks.

Section 03

Core Architecture and Features: Mesh Cluster and Multi-Scenario Support

viiwork supports single-machine multi-model deployment (e.g., 10 GPUs assigned to different model ports). The more powerful Mesh cluster mode allows multiple nodes to form an elastic cluster, automatically routing requests and skipping down nodes. Additionally, the Pipeline feature can chain multiple LLM steps into a virtual model, while the MCP server can seamlessly integrate with AI assistants to provide local inference tools.

Section 04

Practical Features: Intelligent Recommendation and Cost Monitoring

viiwork's setup-node.sh script includes an "I'm Feeling Lucky" mode—inputting a category code will automatically recommend models suitable for the hardware. It integrates Nord Pool spot electricity price tracking; after configuring the ENTSO-E API key, it can real-time monitor the node's electricity cost consumption (hourly cost, daily accumulation, etc.), facilitating cost management for large-scale deployments.

Section 05

Recommended Models and Quantization Strategies: Optimal Choices for 16GB Memory

viiwork is optimized for the 16GB Radeon VII, and all recommended models are kept within the 13GB safe memory limit:

Programming models: Qwen2.5-Coder-14B(Q6_K), Devstral-Small-24B(Q3_K_M), etc.;
Text generation: Qwen3-32B(UD-Q2_K_XL), Gemma-3-27B-IT(Q3_K_S), etc.;
Gemma4 series: Gemma-4-26B-A4B-IT(UD-Q3_K_M) (MoE architecture), Gemma4-E4B-IT(Q8_0), etc.

Section 06

Technical Details and Deployment: ROCm Compatibility and Docker Simplification

The viiwork Docker image fixes the llama.cpp version and patches the HIP FP8 header file to adapt to gfx906 (addressing ROCm6.2+ header file issues). Deployment requirements are simple: a Linux system (with amdgpu driver), Docker that supports GPU access (/dev/kfd, /dev/dri), and no need to install ROCm on the host.

Section 07

Performance and Summary: Open-Source Innovation Breathes New Life Into Old Hardware

viiwork includes benchmark tools like bench.sh (stress test) and bench-sustained.sh (sustained load test). By tapping into the potential of old AMD GPUs, the project provides a feasible inference solution for users on a budget. Features like Mesh clusters and intelligent recommendations reflect pragmatism, promote AI democratization, and demonstrate the innovative capabilities of the open-source community.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15