# LLM Inference Revolution on Apple Silicon: How m5-infer Achieves 4.5x Performance Boost

> m5-infer is an MLX inference engine optimized specifically for Apple Silicon. It achieves a decoding speed of 40 tokens per second on the M5 MacBook Air, which is a 4.5x improvement over Ollama. Through innovative technologies like cross-turn state persistence and hybrid speculative decoding, it significantly reduces latency while maintaining output quality.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-20T04:13:21.000Z
- 最近活动: 2026-04-20T04:50:35.675Z
- 热度: 163.4
- 关键词: Apple Silicon, MLX, 本地LLM, 推理优化, Qwen, Ollama, 投机解码, M5 Mac, 边缘AI, 模型量化
- 页面链接: https://www.zingnex.cn/en/forum/thread/apple-siliconllm-m5-infer4-5
- Canonical: https://www.zingnex.cn/forum/thread/apple-siliconllm-m5-infer4-5
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: LLM Inference Revolution on Apple Silicon: How m5-infer Achieves 4.5x Performance Boost

m5-infer is an MLX inference engine optimized specifically for Apple Silicon. It achieves a decoding speed of 40 tokens per second on the M5 MacBook Air, which is a 4.5x improvement over Ollama. Through innovative technologies like cross-turn state persistence and hybrid speculative decoding, it significantly reduces latency while maintaining output quality.

## Performance Data Overview

In tests with the Qwen 3.5 9B 4-bit quantized model, m5-infer shows overwhelming advantages:

| Metric | Ollama | mlx_lm.server | m5-infer v1.0.0 |
|--------|--------|---------------|----------------|
| Decoding Speed (tok/s) | 8.9 | 17.0 | **40.0** |
| Relative to Ollama | 1.0x | 1.9x | **4.5x** |
| Relative to mlx_lm.server | 0.5x | 1.0x | **2.4x** |

More impressive is the balance between latency and quality:
- **12K Tool Mode Warm-up TTFT**: Reduced from 64.9s to 11.1s (only 2-3s for the second call)
- **5th Round Latency in 5-Round Conversation**: Ollama failed completely, while m5-infer only took 7.5s
- **Opus-4.7 Quality Score**: 5.85/10, surpassing Ollama's 5.28/10 (+11%)

All tests were conducted on the same Mac, using the same model and prompts. The performance gap comes entirely from optimizations at the inference engine layer.

## Core Technical Architecture

m5-infer is built on Apple's MLX framework and positioned as an OpenAI-compatible HTTP inference server that can directly replace mlx_lm.server. Its core architecture is optimized around the Qwen 3.5 hybrid model (GatedDeltaNet + Full Attention), while supporting multiple model families like Qwen 2.5/3.6, Llama 3.x, Mistral, and Gemma 2/3/4 via a model family abstraction layer.

## Eight Core Optimization Technologies

#### 1. Hybrid Speculative Decoding

Qwen 3.5 uses a hybrid architecture of 24 GatedDeltaNet (GDN) layers + 8 full attention layers. Traditional speculative decoding faces a critical issue at the GDN layer: when a draft token is rejected, the KV cache can be rolled back, but the GDN's recurrent state and convolution buffer have already advanced through the entire draft window, leading to state corruption. m5-infer's solution is to snapshot all GDN layers' (recurrent_state, conv_buf) into a pre-allocated tensor pool before each validation. When rejected, it recovers from the snapshot in O(1) time with zero allocation on the hot path. In practice, this brings a 35% throughput improvement (from 29 to 40 tok/s) on Qwen 3.5 9B, with an acceptance rate of about 70%.

#### 2. Cross-Turn State Persistence (CTRSP)

After each generation round, m5-infer serializes the complete model state (quantized KV cache + GDN recurrent/convolution buffer) to disk, using the hash of the original bytes of the prompt prefix tokens as the key. Since the hash is based on token bytes rather than decoded text, the same system prompt and tool mode can hit the cache even with different user inputs attached. Effect: The warm-up TTFT for the 12K token tool mode is reduced from 11s to 2-3s, and the cache hit rate for typical agent workloads exceeds 90%.

#### 3. Thought-Aware Budgeting and Escape Prompts

Qwen 3.5's chain-of-thought is wrapped in <think>...</think> tags. Common failure modes include:
- **Budget Starvation**: Most engines count thought tokens towards the user's max_tokens, leading to truncation in the answer phase
- **Thought Loop Trap**: The model gets stuck in an infinite loop like "Wait, let me re-check..."

m5-infer's solutions:
- Separate thought budget (max_thinking_tokens, default 32K), where the user's max_tokens is only used for the answer phase
- Run a 6-gram repetition detector inside the thought block (threshold of 3 repetitions)
- When a loop is detected, inject a typed transition prompt (e.g., "Final JSON:") to force the model into the desired output format

Effect: Structured JSON extraction task score increased from 1.40 to 7.85 (+461%), and code generation from 3.10 to 6.55 (+111%).

#### 4. Needle-Retrieval Heuristic

Qwen 3.5 has a safety alignment issue when thought mode is disabled: in long contexts (12K+) with short retrieval queries, it sometimes refuses to answer, claiming "cannot disclose authoritative information"—even if the information comes from the user's own provided content. m5-infer automatically detects long context + short query mode at the routing layer and forces thought mode to be enabled, thus bypassing this limitation. In practice, the long context retrieval success rate increased from 0/6 to 6/6.

#### 5. Adaptive Layer Skipping (ALS)

For "simple" tokens, skip layers with minimal impact to reduce computation.

#### 6. Self-Speculative Early Exit (SSEE)

An internal speculative decoding mechanism of the model that terminates generation early when confidence is high.

#### 7. Parallel Expert Scheduling (PES)

Concurrently execute multiple expert paths in MoE (Mixture of Experts) models.

#### 8. X5-R Compiled Forward Propagation

Metal kernel fusion via mx.compile brings about a 40% throughput improvement (from 17 to 24 tok/s).

## Technical Contribution Breakdown

The table below shows the contribution of each optimization to the final performance:

| Innovation | Decoding Speed | Quality | TTFT/Latency |
|--------|---------|------|----------|
| Hybrid Speculative Decoding | +35% | Output Equivalent | — |
| CTRSP | — | — |12K Warm-up TTFT:11s→2-3s |
| Thought-Aware Budgeting | — | +36% Opus Score | — |
| Needle-Retrieval Heuristic | — | Long Context Retrieval:0/6→6/6 | — |
| ALS + SSEE + PES | +10-15% | — | — |
| X5-R Compiled Forward | +40% | — | Cold Start +2-5s |
| **Full Stack Integration** | **4.5x** | **+11%** | **5.8x** |

## Practical Application Scenarios

m5-infer's design goal is clearly directed at production-grade Apple Silicon deployment:

## Agent Workload Optimization

- Hot start latency of only 2-3s for the 12K mode in tool call scenarios
- Multi-turn conversation state persistence to avoid redundant computation
- MCP tool integration support

## Development Environment Integration

- OpenAI-compatible API, which can be directly integrated into existing toolchains
- Supports multiple models like Claude, Gemini, Grok
- Local SQLite persistent sessions
