# ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

> ToshLLM is a local large language model (LLM) execution tool designed specifically for Intel Macs and AMD discrete GPUs. It addresses the issues of corrupted output and poor performance of traditional tools on these hardware via Metal acceleration and dedicated AMD patches.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-12T15:12:08.000Z
- 最近活动: 2026-06-12T15:20:03.358Z
- 热度: 163.9
- 关键词: ToshLLM, 本地大语言模型, Intel Mac, AMD GPU, Metal加速, llama.cpp, MoE模型, 推测解码, 本地AI, 开源工具
- 页面链接: https://www.zingnex.cn/en/forum/thread/toshllm-intel-mac-amd
- Canonical: https://www.zingnex.cn/forum/thread/toshllm-intel-mac-amd
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

ToshLLM is a local large language model (LLM) execution tool designed specifically for Intel Macs and AMD discrete GPUs. It addresses the issues of corrupted output and poor performance of traditional tools on these hardware via Metal acceleration and dedicated AMD patches.

## Original Author and Source

- **Original Author/Maintainer**: Engelbert Delgado ([@engeldlgado](https://github.com/engeldlgado))
- **Source Platform**: GitHub
- **Original Title**: toshllm: Run large language models locally on Intel Macs with AMD GPUs
- **Original Link**: https://github.com/engeldlgado/toshllm
- **Publication Date**: June 12, 2026

---

## Background: The Overlooked Hardware Group

In recent years, local deployment tools for large language models (LLMs) have emerged in abundance. However, a closer look reveals that the vast majority of these tools focus on Apple Silicon chips—M1, M2, M3 series, with their unified memory architecture and powerful neural engine, indeed make ideal platforms for running local LLMs.

But there's a group of users who have been intentionally or unintentionally overlooked: those using Intel Macs equipped with AMD discrete GPUs, including Hackintosh users. These hardware face two critical issues when running traditional local LLM tools:

1. **Corrupted Output**: Standard inference engines like llama.cpp produce garbled or corrupted output on AMD discrete GPUs
2. **Poor Performance**: The speed of model weight transfer via PCIe is far below the hardware's actual capability, causing severe bandwidth bottlenecks

ToshLLM was created precisely to address this pain point.

---

## Project Overview: Built Exclusively for Intel Mac + AMD GPU

ToshLLM is a native macOS SwiftUI application built on llama.cpp, but with dedicated patches for AMD GPUs. Developer Engelbert Delgado developed and optimized this tool on an Intel Mac equipped with an RX 6700 XT 12GB GPU.

## Core Performance Comparison

| Model Configuration | Standard llama.cpp | ToshLLM |
|---------|---------------|---------|
| Qwen3-8B Generation Speed | 0.6–2.6 t/s | ~57 t/s |
| Qwen3.6-35B (MoE) | Unusable | ~26 t/s (with MTP) |

This performance improvement is not a minor optimization but a leap of magnitude—from nearly unusable to smooth operation.

---

## 1. AMD Dedicated Patches

The core of ToshLLM lies in its AMD-specific patches for llama.cpp. These patches resolve two key issues of Metal drivers on AMD GPUs:

- **Chunked Transfer**: Implements phased transfer via patches in the `patches/` directory to bypass Metal driver limitations on host-visible memory allocation
- **Concurrency Control**: Automatically sets the `GGML_METAL_CONCURRENCY_DISABLE` environment variable to ensure stable operation on AMD hardware

## 2. Intelligent Optimization for MoE Models

Mixture-of-Experts (MoE) models like Qwen3.6-35B-A3B have gained attention for their efficient parameter utilization, but they often struggle to run on consumer hardware. ToshLLM provides:

- **Automatic `--n-cpu-moe` Calculation**: Automatically computes the optimal allocation of expert computations on the CPU based on hardware configuration
- **Hybrid Inference Mode**: Enables smooth operation of 35B-level MoE models even with 12GB of VRAM

## 3. MTP Speculative Decoding

ToshLLM supports Multi-Token Prediction (MTP) speculative decoding technology, which can increase generation speed by approximately 34% without losing generation quality. For chat scenarios requiring real-time interaction, this means a significantly smoother user experience.
