Zing Forum

Reading

viiwork: The Load Balancing Tool That Turns Old AMD GPUs Into LLM Inference Clusters

viiwork is an LLM inference load balancer designed specifically for old GPUs like the AMD Radeon VII. It can cluster multiple GPUs with 16GB HBM2 memory, provide an OpenAI-compatible API interface, and breathe new life into legacy hardware.

viiworkAMDRadeon VIILLM推理负载均衡ROCmllama.cppGPU集群开源Mesh集群
Published 2026-04-06 01:43Recent activity 2026-04-06 01:51Estimated read 5 min
viiwork: The Load Balancing Tool That Turns Old AMD GPUs Into LLM Inference Clusters
1

Section 01

[Introduction] viiwork: The Load Balancing Tool That Turns Old AMD GPUs Into LLM Inference Clusters

viiwork is an open-source LLM inference load balancer designed for old gfx906 architecture GPUs like the AMD Radeon VII. It can form a Mesh cluster with multiple GPUs equipped with 16GB HBM2 memory and provide an OpenAI-compatible API interface. It taps into the potential of legacy hardware, offering a low-cost inference solution for users on a budget. Key features include intelligent model recommendation, real-time electricity cost monitoring, and Pipeline chained inference, among others.

2

Section 02

Project Background: 50 GPUs in the Mother-in-Law's Garage and the Discovery of Old Hardware Value

viiwork was born from the author's desire to utilize 50 Radeon VII GPUs in his mother-in-law's garage. Although gfx906 architecture GPUs like the Radeon VII, Instinct MI50/MI60 are old, they are equipped with 16GB/32GB HBM2 memory and 1TB/s bandwidth. The bottleneck for LLM inference is often memory bandwidth rather than computing power, so these old GPUs are still capable of handling inference tasks.

3

Section 03

Core Architecture and Features: Mesh Cluster and Multi-Scenario Support

viiwork supports single-machine multi-model deployment (e.g., 10 GPUs assigned to different model ports). The more powerful Mesh cluster mode allows multiple nodes to form an elastic cluster, automatically routing requests and skipping down nodes. Additionally, the Pipeline feature can chain multiple LLM steps into a virtual model, while the MCP server can seamlessly integrate with AI assistants to provide local inference tools.

4

Section 04

Practical Features: Intelligent Recommendation and Cost Monitoring

viiwork's setup-node.sh script includes an "I'm Feeling Lucky" mode—inputting a category code will automatically recommend models suitable for the hardware. It integrates Nord Pool spot electricity price tracking; after configuring the ENTSO-E API key, it can real-time monitor the node's electricity cost consumption (hourly cost, daily accumulation, etc.), facilitating cost management for large-scale deployments.

5

Section 05

Recommended Models and Quantization Strategies: Optimal Choices for 16GB Memory

viiwork is optimized for the 16GB Radeon VII, and all recommended models are kept within the 13GB safe memory limit:

  • Programming models: Qwen2.5-Coder-14B(Q6_K), Devstral-Small-24B(Q3_K_M), etc.;
  • Text generation: Qwen3-32B(UD-Q2_K_XL), Gemma-3-27B-IT(Q3_K_S), etc.;
  • Gemma4 series: Gemma-4-26B-A4B-IT(UD-Q3_K_M) (MoE architecture), Gemma4-E4B-IT(Q8_0), etc.
6

Section 06

Technical Details and Deployment: ROCm Compatibility and Docker Simplification

The viiwork Docker image fixes the llama.cpp version and patches the HIP FP8 header file to adapt to gfx906 (addressing ROCm6.2+ header file issues). Deployment requirements are simple: a Linux system (with amdgpu driver), Docker that supports GPU access (/dev/kfd, /dev/dri), and no need to install ROCm on the host.

7

Section 07

Performance and Summary: Open-Source Innovation Breathes New Life Into Old Hardware

viiwork includes benchmark tools like bench.sh (stress test) and bench-sustained.sh (sustained load test). By tapping into the potential of old AMD GPUs, the project provides a feasible inference solution for users on a budget. Features like Mesh clusters and intelligent recommendations reflect pragmatism, promote AI democratization, and demonstrate the innovative capabilities of the open-source community.