Zing Forum

Reading

Practical Guide to Local LLM Deployment: Multi-Platform Inference Environment Configuration Based on llama.cpp

A detailed local LLM deployment solution covering three platforms: Fedora Linux (AMD ROCm), macOS (Apple Silicon), and Docker headless server. It provides pre-configured settings, startup scripts, and model quantization recommendations to help developers run open-source large models efficiently on consumer-grade hardware.

llama.cpp本地部署LLM推理AMD ROCmApple Silicon模型量化开源大模型GemmaQwenClaude Code
Published 2026-05-26 06:09Recent activity 2026-05-26 06:18Estimated read 6 min
Practical Guide to Local LLM Deployment: Multi-Platform Inference Environment Configuration Based on llama.cpp
1

Section 01

Practical Local LLM Deployment: Introduction to Multi-Platform Configuration Guide Based on llama.cpp

This project is maintained by AYastrebov and provides a local LLM deployment solution based on llama.cpp, covering three platforms: Fedora Linux (AMD ROCm), macOS (Apple Silicon), and Docker headless server. It includes pre-configured settings, startup scripts, and model quantization recommendations to help developers run open-source large models (such as Gemma, Qwen, etc.) efficiently on consumer-grade hardware. The project source is the GitHub repository local-llm-setup, updated on 2026-05-25.

2

Section 02

Project Background and Positioning

With the development of open-source large language models, developers want to deploy locally to gain privacy protection, low latency, and flexible control, but face barriers such as hardware compatibility and driver configuration. This project addresses this issue by providing a complete configuration solution covering three mainstream scenarios: Fedora Linux (AMD Radeon) workstation, Apple Silicon Mac, and Docker headless server. It also integrates Claude Code collaborative skill definitions.

3

Section 03

Supported Hardware Platforms and Model Selection

Hardware Platforms: 1. Fedora Linux + AMD Radeon: Reference configuration: Intel i5-14600K + RX9060 XT (16GB VRAM) + 32GB RAM, using ROCm acceleration; 2. macOS + Apple Silicon: M2 Max + 64GB unified memory, Metal backend; 3. Docker headless server: Intel i3-6100T + 24GB RAM, CPU-only inference.

Model Recommendations: Gemma4 26B-A4B (general dialogue/vision), Qwen3.6 27B (inference/code), Qwen3.6 35B-A3B (MoE, low VRAM), LFM2.5-350M (lightweight, resource-constrained).

4

Section 04

Model Quantization Strategy and MTP Acceleration Technology

Quantization Strategy: Mac (64GB RAM): Gemma4 uses Q8_K_XL (28GB), Qwen3.6 27B uses Q6_K_XL (26GB); Fedora (16GB VRAM): Gemma4 uses Q3_K_XL (13GB), Qwen3.6 35B-A3B uses IQ3_XXS (14GB, need to uncomment KV_CACHE); Docker server: LFM2.5-350M uses Q8_0 (379MB).

MTP Acceleration: Speculative decoding technology that predicts multiple tokens at once, increasing speed by 1.4-2.2 times. It requires -MTP- version GGUF files and startup parameters --spec-type draft-mtp --spec-draft-n-max 6. Dense models benefit more than MoE models.

5

Section 05

Quick Deployment Steps for Fedora Linux Platform

  1. Install the ROCm suite (hipcc, rocminfo, etc.) and add yourself to the render and video user groups; 2. Clone the llama.cpp repository and compile it using build-llama.sh; 3. Copy the gemma-moe and qwen-mtp scripts to ~/.local/bin and grant execution permissions; 4. Use zshrc-snippet.sh to set environment variables and aliases; 5. Copy the models.json and opencode.jsonc configuration files to allow AI assistants to call local models.
6

Section 06

Claude Code Integration and Practical Value of the Project

Claude Code Integration: Define skills in the skills directory to enable Claude Code to call the local llama.cpp service, supporting cloud+local hybrid workflows, suitable for sensitive code or offline scenarios.

Target Audience: AI enthusiasts, enterprise developers with privacy requirements, teams aiming to reduce API costs, and technical learners.

Highlights: Ready-to-use, provides verified configurations and commands, reducing trial-and-error costs.

7

Section 07

Project Summary and Future Outlook

This project lowers the threshold for local LLM deployment and provides a complete cross-platform solution. With the iteration of llama.cpp and the improvement of open-source model capabilities, local deployment will become more mature and user-friendly. For developers who want to get rid of cloud dependency and embrace the open-source ecosystem, this is a practical hands-on guide.