Reading

ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

ToshLLM is a local large language model (LLM) execution tool designed specifically for Intel Macs and AMD discrete GPUs. It addresses the issues of corrupted output and poor performance of traditional tools on these hardware via Metal acceleration and dedicated AMD patches.

ToshLLM本地大语言模型Intel MacAMD GPUMetal加速llama.cppMoE模型推测解码本地AI开源工具

Published 2026-06-12 23:12Recent activity 2026-06-12 23:20Estimated read 6 min

Section 01

Introduction / Main Floor: ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

Section 02

Original Author and Source

Original Author/Maintainer: Engelbert Delgado (@engeldlgado)
Source Platform: GitHub
Original Title: toshllm: Run large language models locally on Intel Macs with AMD GPUs
Original Link: https://github.com/engeldlgado/toshllm
Publication Date: June 12, 2026

Section 03

Background: The Overlooked Hardware Group

In recent years, local deployment tools for large language models (LLMs) have emerged in abundance. However, a closer look reveals that the vast majority of these tools focus on Apple Silicon chips—M1, M2, M3 series, with their unified memory architecture and powerful neural engine, indeed make ideal platforms for running local LLMs.

But there's a group of users who have been intentionally or unintentionally overlooked: those using Intel Macs equipped with AMD discrete GPUs, including Hackintosh users. These hardware face two critical issues when running traditional local LLM tools:

Corrupted Output: Standard inference engines like llama.cpp produce garbled or corrupted output on AMD discrete GPUs
Poor Performance: The speed of model weight transfer via PCIe is far below the hardware's actual capability, causing severe bandwidth bottlenecks

ToshLLM was created precisely to address this pain point.

Section 04

Project Overview: Built Exclusively for Intel Mac + AMD GPU

ToshLLM is a native macOS SwiftUI application built on llama.cpp, but with dedicated patches for AMD GPUs. Developer Engelbert Delgado developed and optimized this tool on an Intel Mac equipped with an RX 6700 XT 12GB GPU.

Section 05

Core Performance Comparison

Model Configuration	Standard llama.cpp	ToshLLM
Qwen3-8B Generation Speed	0.6–2.6 t/s	~57 t/s
Qwen3.6-35B (MoE)	Unusable	~26 t/s (with MTP)

This performance improvement is not a minor optimization but a leap of magnitude—from nearly unusable to smooth operation.

Section 06

1. AMD Dedicated Patches

The core of ToshLLM lies in its AMD-specific patches for llama.cpp. These patches resolve two key issues of Metal drivers on AMD GPUs:

Chunked Transfer: Implements phased transfer via patches in the patches/ directory to bypass Metal driver limitations on host-visible memory allocation
Concurrency Control: Automatically sets the GGML_METAL_CONCURRENCY_DISABLE environment variable to ensure stable operation on AMD hardware

Section 07

2. Intelligent Optimization for MoE Models

Mixture-of-Experts (MoE) models like Qwen3.6-35B-A3B have gained attention for their efficient parameter utilization, but they often struggle to run on consumer hardware. ToshLLM provides:

Automatic --n-cpu-moe Calculation: Automatically computes the optimal allocation of expert computations on the CPU based on hardware configuration
Hybrid Inference Mode: Enables smooth operation of 35B-level MoE models even with 12GB of VRAM

Section 08

3. MTP Speculative Decoding

ToshLLM supports Multi-Token Prediction (MTP) speculative decoding technology, which can increase generation speed by approximately 34% without losing generation quality. For chat scenarios requiring real-time interaction, this means a significantly smoother user experience.

ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

Introduction / Main Floor: ToshLLM: Enabling Local LLM Execution on Intel Macs with AMD GPUs

Original Author and Source

Background: The Overlooked Hardware Group

Project Overview: Built Exclusively for Intel Mac + AMD GPU

Core Performance Comparison

1. AMD Dedicated Patches

2. Intelligent Optimization for MoE Models

3. MTP Speculative Decoding

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization