Zing Forum

Reading

ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

ToshLLM is a local large language model (LLM) execution tool designed specifically for Intel Macs and AMD discrete GPUs. It addresses the issues of corrupted output and poor performance of traditional tools on these hardware via Metal acceleration and dedicated AMD patches.

ToshLLM本地大语言模型Intel MacAMD GPUMetal加速llama.cppMoE模型推测解码本地AI开源工具
Published 2026-06-12 23:12Recent activity 2026-06-12 23:20Estimated read 6 min
ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs
1

Section 01

Introduction / Main Floor: ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

ToshLLM is a local large language model (LLM) execution tool designed specifically for Intel Macs and AMD discrete GPUs. It addresses the issues of corrupted output and poor performance of traditional tools on these hardware via Metal acceleration and dedicated AMD patches.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: Engelbert Delgado (@engeldlgado)
  • Source Platform: GitHub
  • Original Title: toshllm: Run large language models locally on Intel Macs with AMD GPUs
  • Original Link: https://github.com/engeldlgado/toshllm
  • Publication Date: June 12, 2026

3

Section 03

Background: The Overlooked Hardware Group

In recent years, local deployment tools for large language models (LLMs) have emerged in abundance. However, a closer look reveals that the vast majority of these tools focus on Apple Silicon chips—M1, M2, M3 series, with their unified memory architecture and powerful neural engine, indeed make ideal platforms for running local LLMs.

But there's a group of users who have been intentionally or unintentionally overlooked: those using Intel Macs equipped with AMD discrete GPUs, including Hackintosh users. These hardware face two critical issues when running traditional local LLM tools:

  1. Corrupted Output: Standard inference engines like llama.cpp produce garbled or corrupted output on AMD discrete GPUs
  2. Poor Performance: The speed of model weight transfer via PCIe is far below the hardware's actual capability, causing severe bandwidth bottlenecks

ToshLLM was created precisely to address this pain point.


4

Section 04

Project Overview: Built Exclusively for Intel Mac + AMD GPU

ToshLLM is a native macOS SwiftUI application built on llama.cpp, but with dedicated patches for AMD GPUs. Developer Engelbert Delgado developed and optimized this tool on an Intel Mac equipped with an RX 6700 XT 12GB GPU.

5

Section 05

Core Performance Comparison

Model Configuration Standard llama.cpp ToshLLM
Qwen3-8B Generation Speed 0.6–2.6 t/s ~57 t/s
Qwen3.6-35B (MoE) Unusable ~26 t/s (with MTP)

This performance improvement is not a minor optimization but a leap of magnitude—from nearly unusable to smooth operation.


6

Section 06

1. AMD Dedicated Patches

The core of ToshLLM lies in its AMD-specific patches for llama.cpp. These patches resolve two key issues of Metal drivers on AMD GPUs:

  • Chunked Transfer: Implements phased transfer via patches in the patches/ directory to bypass Metal driver limitations on host-visible memory allocation
  • Concurrency Control: Automatically sets the GGML_METAL_CONCURRENCY_DISABLE environment variable to ensure stable operation on AMD hardware
7

Section 07

2. Intelligent Optimization for MoE Models

Mixture-of-Experts (MoE) models like Qwen3.6-35B-A3B have gained attention for their efficient parameter utilization, but they often struggle to run on consumer hardware. ToshLLM provides:

  • Automatic --n-cpu-moe Calculation: Automatically computes the optimal allocation of expert computations on the CPU based on hardware configuration
  • Hybrid Inference Mode: Enables smooth operation of 35B-level MoE models even with 12GB of VRAM
8

Section 08

3. MTP Speculative Decoding

ToshLLM supports Multi-Token Prediction (MTP) speculative decoding technology, which can increase generation speed by approximately 34% without losing generation quality. For chat scenarios requiring real-time interaction, this means a significantly smoother user experience.