# AX Engine: An LLM Inference Engine Built Exclusively for Apple Silicon M4+

> AX Engine is an LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It adopts n-gram self-speculative decoding technology and achieves significant performance improvements based on MLX.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-05T01:45:23.000Z
- 最近活动: 2026-05-05T02:32:15.011Z
- 热度: 150.2
- 关键词: LLM推理, Apple Silicon, MLX, 推测解码, n-gram, 性能优化, Rust, 本地部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/ax-engine-apple-silicon-m4-llm
- Canonical: https://www.zingnex.cn/forum/thread/ax-engine-apple-silicon-m4-llm
- Markdown 来源: floors_fallback

---

## AX Engine: Overview of a Specialized LLM Inference Engine for Apple Silicon M4+

AX Engine is a LLM inference engine designed specifically for Apple Silicon M4 and newer chips. It leverages n-gram self-speculative decoding technology built on top of MLX to achieve significant performance improvements. Key focus areas include higher effective throughput for supported Transformer models, with innovations in decoding strategies, scheduling, and KV state management.

## Project Background & Motivation

With the evolution of Apple Silicon chips, M4 series provides a strong hardware foundation for local LLM inference. However, existing frameworks like mlx_lm still have optimization space in specific scenarios. AX Engine was developed to offer a more efficient inference solution for supported Transformer model families on Apple Silicon. Its core idea is to build a proprietary scheduling and speculative decoding layer on MLX to generate higher effective throughput than the MLX reference runtime for supported models.

## Core Technical Architecture & Methods

**Execution Layer Design**:
- Uses MLX's official `mlx-c` C API for tensor operations (no reimplementation of matrix multiplication/attention).
- **N-gram Speculative Decoding**: Builds bigram/trigram tables at runtime, predicts up to 4 draft tokens per step. Validates via target model's forward pass, uses EMA acceptance rate gating (τ=0.1, threshold=0.5) to disable speculation if quality drops. No second draft model or model modifications needed.
- **Scheduler & KV Manager**: Integrates request lifecycle, batching, memory block recovery, and execution planning in `ax-engine-core` (deterministic, no async, no framework dependency).
- **Chunked KV Cache**: KV grows via `slice_update` in preallocated buffers; speculation rollback is O(1) (only move sequence length pointer). Flattens lazy evaluation graph after each step to avoid O(N²) depth.
- **Graph Compilation**: Enables `mlx_enable_compile()` at startup for Metal shader reuse across same-shape steps.

**Memory Layer Optimization**: Preloads model weights to GPU memory via `mlx_set_wired_limit` to prevent paging; uses dedicated GPU streams to avoid cross-stream sync on default shared streams.

## Supported Model Families

AX Engine supports specific Transformer model families with handwritten forward pass implementations in `ax-engine-mlx`, using MLX safe tensor format and `model-manifest.json` descriptors.

| Family | Model | Architecture Features |
|--------|-------|-----------------------|
| Gemma4 | gemma-4-e2b-it, gemma-4-e4b-it | Layer-wise embedding, input gating, sliding window + full attention, KV sharing, logit soft cap |
| Qwen3 | Qwen3-4B | Dense GQA + SwiGLU |
| Qwen3.5 | Qwen3.5-9B | Linear attention + MoE FFN, attn_output_gate per-head interleaving |

## Performance Evidence & Results

**Decoding Throughput (token/s)** (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):
| Model | MLX Quantization | Prompt Tokens | mlx_lm | ax engine (speculative) |
|-------|------------------|---------------|--------|--------------------------|
| Gemma4 E2B |4-bit+group64+affine|128|197.5|**467.6** (+136.8%)|
| Gemma4 E2B |4-bit+group64+affine|512|191.9|**464.8** (+142.2%)|
| Qwen3-4B |4-bit+group64|128|169.6|**311.5** (+83.7%)|
| Qwen3-4B |4-bit+group64|512|169.8|**289.5** (+70.4%)|
| Qwen3.5-9B |4-bit+group64+affine|128|92.6|**168.7** (+82.1%)|
| Qwen3.5-9B |4-bit+group64+affine|512|94.8|87.5 (-7.7%)|

Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.

**Prefill Throughput (token/s)**:
| Model | MLX Quantization | Prompt Tokens | mlx_lm | ax engine |
|-------|------------------|---------------|--------|-----------|
| Gemma4 E2B |4-bit+group64+affine|128|2265.8|**3248.7** (+43.4%)|
| Qwen3-4B |4-bit+group64|128|1581.1|**3077.7** (+94.7%)|

**Workload Contract Validation**: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.

## Conclusion & Key Innovations

AX Engine's core argument: Decoding strategies on MLX (speculative token prediction, request scheduling, KV state management) deliver significantly higher effective throughput for supported workloads.

Key Innovations:
1. N-gram self-speculative decoding without a second draft model.
2. Deterministic request lifecycle and KV block management.
3. Workload contract validation tool (ensures correctness, determinism, routing identity, regression testing).
4. Dedicated optimizations for Apple Silicon M4+.

## Project Information & Requirements

- **License**: MIT License.
- **Developer**: DEFAI Private Limited.
- **Community**: Discord link for technical support.
- **Requirements**: macOS on Apple Silicon M4+; Rust1.85+.
- **Structure**: Cargo workspace with multiple crates (engine core, MLX integration, SDK, server, benchmarks, Python extension) for full toolchain support.