Reading

In-depth Testing of Local Large Model Inference on Apple M4: Performance Analysis of MLX + DDTree Speculative Decoding vs. Ollama

Comprehensive evaluation of local large language model inference performance on Apple M4 chip, in-depth comparison of performance differences between MLX framework and Ollama, and analysis of the actual acceleration effect of DDTree speculative decoding technology

MLXApple Silicon本地推理投机解码OllamaQwenMoE大语言模型端侧 AI性能评测

Published 2026-04-26 14:15Recent activity 2026-04-26 14:20Estimated read 3 min

Section 01

In-depth Testing of Local Large Model Inference on Apple M4: Performance Analysis of MLX + DDTree Speculative Decoding vs. Ollama

This evaluation focuses on the local large language model inference performance of the Apple M4 chip, comparing the performance differences between the MLX framework and Ollama, and analyzing the acceleration effect of DDTree speculative decoding technology. Key findings include that the MLX framework is significantly superior to Ollama, the MoE architecture shows great performance advantages on Apple Silicon, and DDTree technology further improves inference speed.

Section 02

Background: The Rise of Edge AI Inference

With the development of large language model technology, efficiently running models on local devices has become a focus of attention. Apple Silicon has become an ideal platform for edge AI inference due to its unified memory architecture and neural engine, but choosing the right framework and optimization strategy is crucial for performance.

Section 03

Testing Environment and Methods

The test is based on MacBook Air M4 (10 cores: 4 performance cores + 6 efficiency cores, 32GB unified memory) with macOS 15.7 Sequoia operating system. The task is to generate a Python implementation code of a red-black tree with up to 200 tokens. Measurement method: 2 warm-up runs + 5 formal timing runs, taking the median value; the metric is pure generation speed (tok/s) excluding pre-filling time.

Section 04

Key Findings: Significant Advantages of MLX

Qwen3.6-35B-MoE Model Comparison

DDTree (MLX): 28.7 tok/s, 2.33x faster than Ollama
Plain MLX: 26.9 tok/s, 2.19x faster than Oll
Ollama (GGUF-Q4_K_P): 12.3 tok/s (baseline)

In-depth Testing of Local Large Model Inference on Apple M4: Performance Analysis of MLX + DDTree Speculative Decoding vs. Ollama

In-depth Testing of Local Large Model Inference on Apple M4: Performance Analysis of MLX + DDTree Speculative Decoding vs. Ollama

Background: The Rise of Edge AI Inference

Testing Environment and Methods

Key Findings: Significant Advantages of MLX

Qwen3.6-35B-MoE Model Comparison

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model