Zing Forum

Reading

AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+

AX Engine is an LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It adopts n-gram self-speculative decoding technology and achieves significant performance improvements based on MLX.

LLM推理Apple SiliconMLX推测解码n-gram性能优化Rust本地部署
Published 2026-05-05 09:45Recent activity 2026-05-05 10:32Estimated read 8 min
AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+
1

Section 01

AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

AX Engine is a LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It leverages n-gram self-speculative decoding technology built on top of MLX to achieve significant performance improvements. Key focus areas include higher effective throughput for supported Transformer models, with innovations in decoding strategies, scheduling, and KV state management.

2

Section 02

Project Background & Motivation

With the evolution of Apple Silicon chips, M4 series provides a strong hardware foundation for local LLM inference. However, existing frameworks like mlx_lm still have optimization space in specific scenarios. AX Engine was developed to offer a more efficient inference solution for supported Transformer model families on Apple Silicon. Its core idea is to build a proprietary scheduling and speculative decoding layer on MLX to generate higher effective throughput than the MLX reference runtime for supported models.

3

Section 03

Core Technical Architecture & Methods

Execution Layer Design:

  • Uses MLX's official mlx-c C API for tensor operations (no reimplementation of matrix multiplication/attention).
  • N-gram Speculative Decoding: Builds bigram/trigram tables at runtime, predicts up to 4 draft tokens per step. Validates via target model's forward pass, uses EMA acceptance rate gating (τ=0.1, threshold=0.5) to disable speculation if quality drops. No second draft model or model modifications needed.
  • Scheduler & KV Manager: Integrates request lifecycle, batching, memory block recovery, and execution planning in ax-engine-core (deterministic, no async, no framework dependency).
  • Chunked KV Cache: KV grows via slice_update in preallocated buffers; speculation rollback is O(1) (only move sequence length pointer). Flattens lazy evaluation graph after each step to avoid O(N²) depth.
  • Graph Compilation: Enables mlx_enable_compile() at startup for Metal shader reuse across same-shape steps.

Memory Layer Optimization: Preloads model weights to GPU memory via mlx_set_wired_limit to prevent paging; uses dedicated GPU streams to avoid cross-stream sync on default shared streams.

4

Section 04

Supported Model Families

AX Engine supports specific Transformer model families with handwritten forward pass implementations in ax-engine-mlx, using MLX safe tensor format and model-manifest.json descriptors.

Family Model Architecture Features
Gemma4 gemma-4-e2b-it, gemma-4-e4b-it Layer-wise embedding, input gating, sliding window + full attention, KV sharing, logit soft cap
Qwen3 Qwen3-4B Dense GQA + SwiGLU
Qwen3.5 Qwen3.5-9B Linear attention + MoE FFN, attn_output_gate per-head interleaving
5

Section 05

Performance Evidence & Results

Decoding Throughput (token/s) (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):

Model MLX Quantization Prompt Tokens mlx_lm ax engine (speculative)
Gemma4 E2B 4-bit+group64+affine 128 197.5 467.6 (+136.8%)
Gemma4 E2B 4-bit+group64+affine 512 191.9 464.8 (+142.2%)
Qwen3-4B 4-bit+group64 128 169.6 311.5 (+83.7%)
Qwen3-4B 4-bit+group64 512 169.8 289.5 (+70.4%)
Qwen3.5-9B 4-bit+group64+affine 128 92.6 168.7 (+82.1%)
Qwen3.5-9B 4-bit+group64+affine 512 94.8 87.5 (-7.7%)

Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.

Prefill Throughput (token/s):

Model MLX Quantization Prompt Tokens mlx_lm ax engine
Gemma4 E2B 4-bit+group64+affine 128 2265.8 3248.7 (+43.4%)
Qwen3-4B 4-bit+group64 128 1581.1 3077.7 (+94.7%)

Workload Contract Validation: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.

6

Section 06

Conclusion & Key Innovations

AX Engine's core argument: Decoding strategies on MLX (speculative token prediction, request scheduling, KV state management) deliver significantly higher effective throughput for supported workloads.

Key Innovations:

  1. N-gram self-speculative decoding without a second draft model.
  2. Deterministic request lifecycle and KV block management.
  3. Workload contract validation tool (ensures correctness, determinism, routing identity, regression testing).
  4. Dedicated optimizations for Apple Silicon M4+.
7

Section 07

Project Information & Requirements

  • License: MIT License.
  • Developer: DEFAI Private Limited.
  • Community: Discord link for technical support.
  • Requirements: macOS on Apple Silicon M4+; Rust1.85+.
  • Structure: Cargo workspace with multiple crates (engine core, MLX integration, SDK, server, benchmarks, Python extension) for full toolchain support.