Zing Forum

Reading

LLM-Toolkit: A Practical Guide to Maximizing Local Large Model Performance in Hybrid GPU Environments

A local LLM inference toolkit for AMD APU + NVIDIA discrete GPU hybrid environments, enabling flexible dual-GPU scheduling via the Vulkan backend and resolving ROCm compatibility issues on older architectures.

LLM本地部署VulkanAMDNVIDIAROCmllama.cppGPU加速混合显卡Linux
Published 2026-04-12 07:39Recent activity 2026-04-12 07:52Estimated read 6 min
LLM-Toolkit: A Practical Guide to Maximizing Local Large Model Performance in Hybrid GPU Environments
1

Section 01

Introduction / Main Floor: LLM-Toolkit: A Practical Guide to Maximizing Local Large Model Performance in Hybrid GPU Environments

A local LLM inference toolkit for AMD APU + NVIDIA discrete GPU hybrid environments, enabling flexible dual-GPU scheduling via the Vulkan backend and resolving ROCm compatibility issues on older architectures.

2

Section 02

Practical Challenges in Hybrid GPU Environments

For users who want to run large language models locally, hardware configuration often involves compromises. Many people have desktops or laptops equipped with AMD APUs and NVIDIA discrete GPUs. This heterogeneous environment faces numerous challenges when running LLMs on Linux: ROCm has limited support for older AMD GPUs, Vulkan is universal but complex to configure, and there is little documentation on how to make the two work together.

The LLM-Toolkit project was created to address this specific scenario. It is not a general LLM deployment solution, but a deeply optimized toolkit for the specific hardware combination of AMD Ryzen APU + NVIDIA discrete GPU.

3

Section 03

Project Background and Hardware Configuration

The project author's actual hardware environment is quite representative:

  • CPU/APU: AMD Ryzen 7 5700G (8 cores, 16 threads, Zen 3 architecture)
  • Integrated GPU: Radeon Vega 8 (GCN 5 architecture, 8 CUs, shared memory)
  • Discrete GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
  • Memory: 48GB DDR4 (shared with Vega 8)
  • Operating System: Ubuntu 25.10, kernel 6.17

The challenge with this configuration is: Vega 8 belongs to the GCN architecture, while ROCm officially only supports RDNA2 (gfx1030+) and newer architectures. This means ROCm/HIP backend cannot be used to accelerate inference on the APU.

4

Section 04

Technical Solution: Vulkan as a Universal Bridge

After in-depth research and testing, the project author finally chose Vulkan as the unified backend. This decision was based on the following key findings:

5

Section 05

Limitations of ROCm

In the Linux kernel 6.17 environment, ROCm/HIP has serious driver-level issues on Vega 8. Both the ROCm 5.7 included with Ubuntu and ROCm 6.4.4 tested via Docker crash at the amdgpu driver level (MODE2 GPU reset caused by no-retry page fault). This is a kernel driver bug that cannot be fixed from user space.

6

Section 06

Advantages of Vulkan

As a cross-platform graphics API, Vulkan has broader hardware support. The project's test data shows performance differences across different backends (using the Llama 2 7B Chat Q4_K_S model):

Backend Device Prompt Processing Speed Generation Speed
Vulkan RTX 5090 2,117 tokens/s 273 tokens/s
Vulkan Vega 8 iGPU 49 tokens/s 14 tokens/s
CPU-only Ryzen 5700G 55 tokens/s 12 tokens/s

The data reveals several key insights:

  1. RTX 5090's Vulkan performance is amazing: With a prompt processing speed exceeding 2000 tokens/s, even long contexts can be preprocessed instantly
  2. Vega 8's Vulkan is usable: Although performance is not as good as the discrete GPU, a prompt processing speed of 49 tokens/s is fully usable for lightweight tasks
  3. CPU mode still has value: In specific scenarios, CPU-only mode may be more reliable than faulty GPU acceleration
7

Section 07

Toolkit Composition and Usage

LLM-Toolkit provides a series of carefully designed startup scripts covering different usage scenarios:

8

Section 08

Core Scripts

  • start-llm.sh: Main launcher, uses Vulkan backend and RTX 5090 by default, includes memory protection mechanisms
  • run-llamaserver-vulkan.sh: Directly calls the Vulkan wrapper for llama-server, supports full device selection
  • run-llamaserver-rocm.sh: Legacy ROCm/HIP wrapper, currently only used as an alternative for CPU-only mode
  • build-llamacpp-rocm-vega.sh: Script to build llama.cpp for the gfx900 target, applies HIP 5.7 compatibility patches
  • launch-lmstudio-vulkan.sh: Dedicated launcher to configure the Vulkan environment for LM Studio