Zing Forum

Reading

mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering OpenAI-compatible HTTP APIs and continuous batching capabilities. It loads raw safetensors weights, runs numerically correct transformer forward passes on Metal GPUs, and serves concurrent users via a vLLM-style single worker thread/three-queue scheduler.

MLXApple SiliconLLaMA推理引擎C++连续批处理OpenAI兼容MetalKV缓存数值正确性
Published 2026-06-02 06:07Recent activity 2026-06-02 06:23Estimated read 6 min
mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX
1

Section 01

Introduction / Main Post: mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering OpenAI-compatible HTTP APIs and continuous batching capabilities. It loads raw safetensors weights, runs numerically correct transformer forward passes on Metal GPUs, and serves concurrent users via a vLLM-style single worker thread/three-queue scheduler.

2

Section 02

Original Author and Source


3

Section 03

Introduction: Why Build an Inference Engine From Scratch?

In the field of AI inference, most developers choose to use off-the-shelf frameworks—vLLM, TensorRT-LLM, llama.cpp, etc. These tools are heavily optimized and feature-rich, but they are black boxes. When you need to understand every numerically sensitive stage of a transformer, or implement specific optimizations on Apple Silicon, off-the-shelf solutions may not meet your needs.

mlxforge takes a different path: building a complete LLaMA inference engine from scratch in C++ on Apple's MLX framework. This is not to reinvent the wheel, but to deeply understand how the wheel turns.


4

Section 04

What is mlxforge?

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering:

  • OpenAI-compatible HTTP APIs: Endpoints like /v1/chat/completions, /v1/completions, /v1/models
  • Continuous Batching: vLLM-style single worker thread/three-queue scheduler
  • Numerical Correctness: Every numerically sensitive stage is validated against the mlx-lm golden standard
  • KV Cache: Single-sequence and batch caches with support for filter/eviction and merge/admission

Target Model: mlx-community/Llama-3.2-1B-Instruct (default fp16, optional 4-bit quantization)


5

Section 05

Numerical Correctness

One of mlxforge's core design principles is numerical correctness. The forward pass logits and greedy tokens exactly match those of mlx-lm. Golden standard .npy fixtures act as a gate for each step, because the failure mode here is silent garbage output rather than crashes.

This strict validation ensures:

  • Model behavior is consistent with the reference implementation
  • Numerical results can be trusted during debugging
  • Predictability in production environments
6

Section 06

KV Cache Architecture

mlxforge implements two KV cache modes:

Single Sequence Cache (SingleKVCache):

  • Cache optimized for a single sequence
  • Supports left-padded layout
  • 256-token block growth strategy

Batch Cache (BatchKVCache):

  • Supports multi-sequence batching
  • update_and_fetch: Update and retrieve cache state
  • filter: Evict unwanted tokens
  • merge: Admit new sequences
  • pad_dummies: Handle variable-length sequences
7

Section 07

Continuous Batching

mlxforge adopts a vLLM-style continuous batching architecture:

  • Single GPU Worker Thread: Owns all MLX states and is the only thread that calls eval/async_eval
  • One async_eval per Decoding Step: The entire batch shares one evaluation
  • Batch Size Bucketing: Ensures repeated graph shapes for optimized compilation

This design maximizes GPU utilization while maintaining code simplicity and maintainability.

8

Section 08

Sampling as Graph Operations

mlxforge implements sampling as MLX graph operations, supporting:

  • Greedy sampling
  • Temperature sampling
  • Top-k sampling
  • Top-p sampling

Key Optimization: No need to read logits back to the host—all sampling operations are done on the GPU.