# LLM-Toolkit: A Practical Guide to Maximizing Local Large Model Performance in Hybrid GPU Environments

> A local LLM inference toolkit for AMD APU + NVIDIA discrete GPU hybrid environments, enabling flexible dual-GPU scheduling via the Vulkan backend and resolving ROCm compatibility issues on older architectures.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-11T23:39:46.000Z
- 最近活动: 2026-04-11T23:52:14.156Z
- 热度: 163.8
- 关键词: LLM, 本地部署, Vulkan, AMD, NVIDIA, ROCm, llama.cpp, GPU加速, 混合显卡, Linux
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-toolkit-gpu
- Canonical: https://www.zingnex.cn/forum/thread/llm-toolkit-gpu
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: LLM-Toolkit: A Practical Guide to Maximizing Local Large Model Performance in Hybrid GPU Environments

A local LLM inference toolkit for AMD APU + NVIDIA discrete GPU hybrid environments, enabling flexible dual-GPU scheduling via the Vulkan backend and resolving ROCm compatibility issues on older architectures.

## Practical Challenges in Hybrid GPU Environments

For users who want to run large language models locally, hardware configuration often involves compromises. Many people have desktops or laptops equipped with AMD APUs and NVIDIA discrete GPUs. This heterogeneous environment faces numerous challenges when running LLMs on Linux: ROCm has limited support for older AMD GPUs, Vulkan is universal but complex to configure, and there is little documentation on how to make the two work together.

The LLM-Toolkit project was created to address this specific scenario. It is not a general LLM deployment solution, but a deeply optimized toolkit for the specific hardware combination of AMD Ryzen APU + NVIDIA discrete GPU.

## Project Background and Hardware Configuration

The project author's actual hardware environment is quite representative:

- **CPU/APU**: AMD Ryzen 7 5700G (8 cores, 16 threads, Zen 3 architecture)
- **Integrated GPU**: Radeon Vega 8 (GCN 5 architecture, 8 CUs, shared memory)
- **Discrete GPU**: NVIDIA GeForce RTX 5090 (32GB VRAM)
- **Memory**: 48GB DDR4 (shared with Vega 8)
- **Operating System**: Ubuntu 25.10, kernel 6.17

The challenge with this configuration is: Vega 8 belongs to the GCN architecture, while ROCm officially only supports RDNA2 (gfx1030+) and newer architectures. This means ROCm/HIP backend cannot be used to accelerate inference on the APU.

## Technical Solution: Vulkan as a Universal Bridge

After in-depth research and testing, the project author finally chose Vulkan as the unified backend. This decision was based on the following key findings:

## Limitations of ROCm

In the Linux kernel 6.17 environment, ROCm/HIP has serious driver-level issues on Vega 8. Both the ROCm 5.7 included with Ubuntu and ROCm 6.4.4 tested via Docker crash at the amdgpu driver level (MODE2 GPU reset caused by no-retry page fault). This is a kernel driver bug that cannot be fixed from user space.

## Advantages of Vulkan

As a cross-platform graphics API, Vulkan has broader hardware support. The project's test data shows performance differences across different backends (using the Llama 2 7B Chat Q4_K_S model):

| Backend | Device | Prompt Processing Speed | Generation Speed |
|------|------|-------------|---------|
| Vulkan | RTX 5090 | 2,117 tokens/s | 273 tokens/s |
| Vulkan | Vega 8 iGPU | 49 tokens/s | 14 tokens/s |
| CPU-only | Ryzen 5700G | 55 tokens/s | 12 tokens/s |

The data reveals several key insights:

1. **RTX 5090's Vulkan performance is amazing**: With a prompt processing speed exceeding 2000 tokens/s, even long contexts can be preprocessed instantly
2. **Vega 8's Vulkan is usable**: Although performance is not as good as the discrete GPU, a prompt processing speed of 49 tokens/s is fully usable for lightweight tasks
3. **CPU mode still has value**: In specific scenarios, CPU-only mode may be more reliable than faulty GPU acceleration

## Toolkit Composition and Usage

LLM-Toolkit provides a series of carefully designed startup scripts covering different usage scenarios:

## Core Scripts

- **start-llm.sh**: Main launcher, uses Vulkan backend and RTX 5090 by default, includes memory protection mechanisms
- **run-llamaserver-vulkan.sh**: Directly calls the Vulkan wrapper for llama-server, supports full device selection
- **run-llamaserver-rocm.sh**: Legacy ROCm/HIP wrapper, currently only used as an alternative for CPU-only mode
- **build-llamacpp-rocm-vega.sh**: Script to build llama.cpp for the gfx900 target, applies HIP 5.7 compatibility patches
- **launch-lmstudio-vulkan.sh**: Dedicated launcher to configure the Vulkan environment for LM Studio