Zing 论坛

正文

AX Engine:专为Apple Silicon M4+打造的LLM推理引擎

AX Engine是一个专为Apple Silicon M4及更新芯片设计的LLM推理引擎,采用n-gram自推测解码技术,在MLX基础上实现显著的性能提升。

LLM推理Apple SiliconMLX推测解码n-gram性能优化Rust本地部署
发布时间 2026/05/05 09:45最近活动 2026/05/05 10:32预计阅读 8 分钟
AX Engine:专为Apple Silicon M4+打造的LLM推理引擎
1

章节 01

AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

AX Engine is a LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It leverages n-gram self-speculative decoding technology built on top of MLX to achieve significant performance improvements. Key focus areas include higher effective throughput for supported Transformer models, with innovations in decoding strategies, scheduling, and KV state management.

2

章节 02

Project Background & Motivation

With the evolution of Apple Silicon chips, M4 series provides a strong hardware foundation for local LLM inference. However, existing frameworks like mlx_lm still have optimization space in specific scenarios. AX Engine was developed to offer a more efficient inference solution for supported Transformer model families on Apple Silicon. Its core idea is to build a proprietary scheduling and speculative decoding layer on MLX to generate higher effective throughput than the MLX reference runtime for supported models.

3

章节 03

Core Technical Architecture & Methods

Execution Layer Design:

  • Uses MLX's official mlx-c C API for tensor operations (no reimplementation of matrix multiplication/attention).
  • N-gram Speculative Decoding: Builds bigram/trigram tables at runtime, predicts up to 4 draft tokens per step. Validates via target model's forward pass, uses EMA acceptance rate gating (τ=0.1, threshold=0.5) to disable speculation if quality drops. No second draft model or model modifications needed.
  • Scheduler & KV Manager: Integrates request lifecycle, batching, memory block recovery, and execution planning in ax-engine-core (deterministic, no async, no framework dependency).
  • Chunked KV Cache: KV grows via slice_update in preallocated buffers; speculation rollback is O(1) (only move sequence length pointer). Flattens lazy evaluation graph after each step to avoid O(N²) depth.
  • Graph Compilation: Enables mlx_enable_compile() at startup for Metal shader reuse across same-shape steps.

Memory Layer Optimization: Preloads model weights to GPU memory via mlx_set_wired_limit to prevent paging; uses dedicated GPU streams to avoid cross-stream sync on default shared streams.

4

章节 04

Supported Model Families

AX Engine supports specific Transformer model families with handwritten forward pass implementations in ax-engine-mlx, using MLX safe tensor format and model-manifest.json descriptors.

Family Model Architecture Features
Gemma4 gemma-4-e2b-it, gemma-4-e4b-it Layer-wise embedding, input gating, sliding window + full attention, KV sharing, logit soft cap
Qwen3 Qwen3-4B Dense GQA + SwiGLU
Qwen3.5 Qwen3.5-9B Linear attention + MoE FFN, attn_output_gate per-head interleaving
5

章节 05

Performance Evidence & Results

Decoding Throughput (token/s) (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):

Model MLX Quantization Prompt Tokens mlx_lm ax engine (speculative)
Gemma4 E2B 4-bit+group64+affine 128 197.5 467.6 (+136.8%)
Gemma4 E2B 4-bit+group64+affine 512 191.9 464.8 (+142.2%)
Qwen3-4B 4-bit+group64 128 169.6 311.5 (+83.7%)
Qwen3-4B 4-bit+group64 512 169.8 289.5 (+70.4%)
Qwen3.5-9B 4-bit+group64+affine 128 92.6 168.7 (+82.1%)
Qwen3.5-9B 4-bit+group64+affine 512 94.8 87.5 (-7.7%)

Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.

Prefill Throughput (token/s):

Model MLX Quantization Prompt Tokens mlx_lm ax engine
Gemma4 E2B 4-bit+group64+affine 128 2265.8 3248.7 (+43.4%)
Qwen3-4B 4-bit+group64 128 1581.1 3077.7 (+94.7%)

Workload Contract Validation: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.

6

章节 06

Conclusion & Key Innovations

AX Engine's core argument: Decoding strategies on MLX (speculative token prediction, request scheduling, KV state management) deliver significantly higher effective throughput for supported workloads.

Key Innovations:

  1. N-gram self-speculative decoding without a second draft model.
  2. Deterministic request lifecycle and KV block management.
  3. Workload contract validation tool (ensures correctness, determinism, routing identity, regression testing).
  4. Dedicated optimizations for Apple Silicon M4+.
7

章节 07

Project Information & Requirements

  • License: MIT License.
  • Developer: DEFAI Private Limited.
  • Community: Discord link for technical support.
  • Requirements: macOS on Apple Silicon M4+; Rust1.85+.
  • Structure: Cargo workspace with multiple crates (engine core, MLX integration, SDK, server, benchmarks, Python extension) for full toolchain support.