Decoding Throughput (token/s) (Test env: Apple M5 Max +128GB, macOS26.4.1, batch=1, prefill_step_size=2048):
| Model |
MLX Quantization |
Prompt Tokens |
mlx_lm |
ax engine (speculative) |
| Gemma4 E2B |
4-bit+group64+affine |
128 |
197.5 |
467.6 (+136.8%) |
| Gemma4 E2B |
4-bit+group64+affine |
512 |
191.9 |
464.8 (+142.2%) |
| Qwen3-4B |
4-bit+group64 |
128 |
169.6 |
311.5 (+83.7%) |
| Qwen3-4B |
4-bit+group64 |
512 |
169.8 |
289.5 (+70.4%) |
| Qwen3.5-9B |
4-bit+group64+affine |
128 |
92.6 |
168.7 (+82.1%) |
| Qwen3.5-9B |
4-bit+group64+affine |
512 |
94.8 |
87.5 (-7.7%) |
Note: Qwen3.5 uses rollback-safe branch/recompute path for SSM state; linear attention speculation repeats n-gram evidence and cools after partial acceptance. 512-token random case falls back to greedy due to overhead exceeding draft gains.
Prefill Throughput (token/s):
| Model |
MLX Quantization |
Prompt Tokens |
mlx_lm |
ax engine |
| Gemma4 E2B |
4-bit+group64+affine |
128 |
2265.8 |
3248.7 (+43.4%) |
| Qwen3-4B |
4-bit+group64 |
128 |
1581.1 |
3077.7 (+94.7%) |
Workload Contract Validation: All tested models (Gemma4 E2B, Qwen3-4B, Qwen3.5-9B) passed with valid TTFT and token counts.