Reading

AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+

AX Engine is an LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It adopts n-gram self-speculative decoding technology and achieves significant performance improvements based on MLX.

LLM推理Apple SiliconMLX推测解码n-gram性能优化Rust本地部署

Published 2026-05-05 09:45Recent activity 2026-05-05 10:32Estimated read 8 min

AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+

Section 01

AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

AX Engine is a LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It leverages n-gram self-speculative decoding technology built on top of MLX to achieve significant performance improvements. Key focus areas include higher effective throughput for supported Transformer models, with innovations in decoding strategies, scheduling, and KV state management.

Section 02

Project Background & Motivation

With the evolution of Apple Silicon chips, M4 series provides a strong hardware foundation for local LLM inference. However, existing frameworks like mlx_lm still have optimization space in specific scenarios. AX Engine was developed to offer a more efficient inference solution for supported Transformer model families on Apple Silicon. Its core idea is to build a proprietary scheduling and speculative decoding layer on MLX to generate higher effective throughput than the MLX reference runtime for supported models.

Section 03

Core Technical Architecture & Methods

Execution Layer Design:

Uses MLX's official mlx-c C API for tensor operations (no reimplementation of matrix multiplication/attention).
N-gram Speculative Decoding: Builds bigram/trigram tables at runtime, predicts up to 4 draft tokens per step. Validates via target model's forward pass, uses EMA acceptance rate gating (τ=0.1, threshold=0.5) to disable speculation if quality drops. No second draft model or model modifications needed.
Scheduler & KV Manager: Integrates request lifecycle, batching, memory block recovery, and execution planning in ax-engine-core (deterministic, no async, no framework dependency).
Chunked KV Cache: KV grows via slice_update in preallocated buffers; speculation rollback is O(1) (only move sequence length pointer). Flattens lazy evaluation graph after each step to avoid O(N²) depth.
Graph Compilation: Enables mlx_enable_compile() at startup for Metal shader reuse across same-shape steps.

Memory Layer Optimization: Preloads model weights to GPU memory via mlx_set_wired_limit to prevent paging; uses dedicated GPU streams to avoid cross-stream sync on default shared streams.

Section 04

Supported Model Families

AX Engine supports specific Transformer model families with handwritten forward pass implementations in ax-engine-mlx, using MLX safe tensor format and model-manifest.json descriptors.

Family	Model	Architecture Features
Gemma4	gemma-4-e2b-it, gemma-4-e4b-it	Layer-wise embedding, input gating, sliding window + full attention, KV sharing, logit soft cap
Qwen3	Qwen3-4B	Dense GQA + SwiGLU
Qwen3.5	Qwen3.5-9B	Linear attention + MoE FFN, attn_output_gate per-head interleaving

Section 05

Performance Evidence & Results

Decoding Throughput (token/s) (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):

Model	MLX Quantization	Prompt Tokens	mlx_lm	ax engine (speculative)
Gemma4 E2B	4-bit+group64+affine	128	197.5	467.6 (+136.8%)
Gemma4 E2B	4-bit+group64+affine	512	191.9	464.8 (+142.2%)
Qwen3-4B	4-bit+group64	128	169.6	311.5 (+83.7%)
Qwen3-4B	4-bit+group64	512	169.8	289.5 (+70.4%)
Qwen3.5-9B	4-bit+group64+affine	128	92.6	168.7 (+82.1%)
Qwen3.5-9B	4-bit+group64+affine	512	94.8	87.5 (-7.7%)

Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.

Prefill Throughput (token/s):

Model	MLX Quantization	Prompt Tokens	mlx_lm	ax engine
Gemma4 E2B	4-bit+group64+affine	128	2265.8	3248.7 (+43.4%)
Qwen3-4B	4-bit+group64	128	1581.1	3077.7 (+94.7%)

Workload Contract Validation: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.

Section 06

Conclusion & Key Innovations

AX Engine's core argument: Decoding strategies on MLX (speculative token prediction, request scheduling, KV state management) deliver significantly higher effective throughput for supported workloads.

Key Innovations:

N-gram self-speculative decoding without a second draft model.
Deterministic request lifecycle and KV block management.
Workload contract validation tool (ensures correctness, determinism, routing identity, regression testing).
Dedicated optimizations for Apple Silicon M4+.

Section 07

Project Information & Requirements

License: MIT License.
Developer: DEFAI Private Limited.
Community: Discord link for technical support.
Requirements: macOS on Apple Silicon M4+; Rust1.85+.
Structure: Cargo workspace with multiple crates (engine core, MLX integration, SDK, server, benchmarks, Python extension) for full toolchain support.

AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+

AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

Project Background & Motivation

Core Technical Architecture & Methods

Supported Model Families

Performance Evidence & Results

Conclusion & Key Innovations

Project Information & Requirements

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model