正文

AX Engine：专为Apple Silicon M4+打造的LLM推理引擎

AX Engine是一个专为Apple Silicon M4及更新芯片设计的LLM推理引擎，采用n-gram自推测解码技术，在MLX基础上实现显著的性能提升。

LLM推理Apple SiliconMLX推测解码n-gram性能优化Rust本地部署

发布时间 2026/05/05 09:45最近活动 2026/05/05 10:32预计阅读 8 分钟

章节 01

AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

AX Engine is a LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It leverages n-gram self-speculative decoding technology built on top of MLX to achieve significant performance improvements. Key focus areas include higher effective throughput for supported Transformer models, with innovations in decoding strategies, scheduling, and KV state management.

章节 02

Project Background & Motivation

With the evolution of Apple Silicon chips, M4 series provides a strong hardware foundation for local LLM inference. However, existing frameworks like mlx_lm still have optimization space in specific scenarios. AX Engine was developed to offer a more efficient inference solution for supported Transformer model families on Apple Silicon. Its core idea is to build a proprietary scheduling and speculative decoding layer on MLX to generate higher effective throughput than the MLX reference runtime for supported models.

章节 03

Core Technical Architecture & Methods

Execution Layer Design:

Uses MLX's official mlx-c C API for tensor operations (no reimplementation of matrix multiplication/attention).
N-gram Speculative Decoding: Builds bigram/trigram tables at runtime, predicts up to 4 draft tokens per step. Validates via target model's forward pass, uses EMA acceptance rate gating (τ=0.1, threshold=0.5) to disable speculation if quality drops. No second draft model or model modifications needed.
Scheduler & KV Manager: Integrates request lifecycle, batching, memory block recovery, and execution planning in ax-engine-core (deterministic, no async, no framework dependency).
Chunked KV Cache: KV grows via slice_update in preallocated buffers; speculation rollback is O(1) (only move sequence length pointer). Flattens lazy evaluation graph after each step to avoid O(N²) depth.
Graph Compilation: Enables mlx_enable_compile() at startup for Metal shader reuse across same-shape steps.

Memory Layer Optimization: Preloads model weights to GPU memory via mlx_set_wired_limit to prevent paging; uses dedicated GPU streams to avoid cross-stream sync on default shared streams.

章节 04

Supported Model Families

AX Engine supports specific Transformer model families with handwritten forward pass implementations in ax-engine-mlx, using MLX safe tensor format and model-manifest.json descriptors.

Family	Model	Architecture Features
Gemma4	gemma-4-e2b-it, gemma-4-e4b-it	Layer-wise embedding, input gating, sliding window + full attention, KV sharing, logit soft cap
Qwen3	Qwen3-4B	Dense GQA + SwiGLU
Qwen3.5	Qwen3.5-9B	Linear attention + MoE FFN, attn_output_gate per-head interleaving

章节 05

Performance Evidence & Results

Decoding Throughput (token/s) (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):

Model	MLX Quantization	Prompt Tokens	mlx_lm	ax engine (speculative)
Gemma4 E2B	4-bit+group64+affine	128	197.5	467.6 (+136.8%)
Gemma4 E2B	4-bit+group64+affine	512	191.9	464.8 (+142.2%)
Qwen3-4B	4-bit+group64	128	169.6	311.5 (+83.7%)
Qwen3-4B	4-bit+group64	512	169.8	289.5 (+70.4%)
Qwen3.5-9B	4-bit+group64+affine	128	92.6	168.7 (+82.1%)
Qwen3.5-9B	4-bit+group64+affine	512	94.8	87.5 (-7.7%)

Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.

Prefill Throughput (token/s):

Model	MLX Quantization	Prompt Tokens	mlx_lm	ax engine
Gemma4 E2B	4-bit+group64+affine	128	2265.8	3248.7 (+43.4%)
Qwen3-4B	4-bit+group64	128	1581.1	3077.7 (+94.7%)

Workload Contract Validation: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.

章节 06

Conclusion & Key Innovations

AX Engine's core argument: Decoding strategies on MLX (speculative token prediction, request scheduling, KV state management) deliver significantly higher effective throughput for supported workloads.

Key Innovations:

N-gram self-speculative decoding without a second draft model.
Deterministic request lifecycle and KV block management.
Workload contract validation tool (ensures correctness, determinism, routing identity, regression testing).
Dedicated optimizations for Apple Silicon M4+.

章节 07

Project Information & Requirements

License: MIT License.
Developer: DEFAI Private Limited.
Community: Discord link for technical support.
Requirements: macOS on Apple Silicon M4+; Rust1.85+.
Structure: Cargo workspace with multiple crates (engine core, MLX integration, SDK, server, benchmarks, Python extension) for full toolchain support.