# mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

> mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering OpenAI-compatible HTTP APIs and continuous batching capabilities. It loads raw safetensors weights, runs numerically correct transformer forward passes on Metal GPUs, and serves concurrent users via a vLLM-style single worker thread/three-queue scheduler.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T22:07:25.000Z
- 最近活动: 2026-06-01T22:23:49.180Z
- 热度: 167.7
- 关键词: MLX, Apple Silicon, LLaMA, 推理引擎, C++, 连续批处理, OpenAI兼容, Metal, KV缓存, 数值正确性, 量化, 本地LLM
- 页面链接: https://www.zingnex.cn/en/forum/thread/mlxforge-apple-mlxllama
- Canonical: https://www.zingnex.cn/forum/thread/mlxforge-apple-mlxllama
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering OpenAI-compatible HTTP APIs and continuous batching capabilities. It loads raw safetensors weights, runs numerically correct transformer forward passes on Metal GPUs, and serves concurrent users via a vLLM-style single worker thread/three-queue scheduler.

## Original Author and Source

- **Original Author/Maintainer**: hvasconcelos
- **Source Platform**: GitHub
- **Original Title**: mlxforge
- **Original Link**: https://github.com/hvasconcelos/mlxforge
- **Publication Date**: June 1, 2026

---

## Introduction: Why Build an Inference Engine From Scratch?

In the field of AI inference, most developers choose to use off-the-shelf frameworks—vLLM, TensorRT-LLM, llama.cpp, etc. These tools are heavily optimized and feature-rich, but they are black boxes. When you need to understand every numerically sensitive stage of a transformer, or implement specific optimizations on Apple Silicon, off-the-shelf solutions may not meet your needs.

mlxforge takes a different path: building a complete LLaMA inference engine from scratch in C++ on Apple's MLX framework. This is not to reinvent the wheel, but to deeply understand how the wheel turns.

---

## What is mlxforge?

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering:

- **OpenAI-compatible HTTP APIs**: Endpoints like `/v1/chat/completions`, `/v1/completions`, `/v1/models`
- **Continuous Batching**: vLLM-style single worker thread/three-queue scheduler
- **Numerical Correctness**: Every numerically sensitive stage is validated against the `mlx-lm` golden standard
- **KV Cache**: Single-sequence and batch caches with support for filter/eviction and merge/admission

Target Model: `mlx-community/Llama-3.2-1B-Instruct` (default fp16, optional 4-bit quantization)

---

## Numerical Correctness

One of mlxforge's core design principles is numerical correctness. The forward pass logits and greedy tokens exactly match those of `mlx-lm`. Golden standard `.npy` fixtures act as a gate for each step, because the failure mode here is **silent garbage output rather than crashes**.

This strict validation ensures:
- Model behavior is consistent with the reference implementation
- Numerical results can be trusted during debugging
- Predictability in production environments

## KV Cache Architecture

mlxforge implements two KV cache modes:

**Single Sequence Cache (SingleKVCache)**:
- Cache optimized for a single sequence
- Supports left-padded layout
- 256-token block growth strategy

**Batch Cache (BatchKVCache)**:
- Supports multi-sequence batching
- `update_and_fetch`: Update and retrieve cache state
- `filter`: Evict unwanted tokens
- `merge`: Admit new sequences
- `pad_dummies`: Handle variable-length sequences

## Continuous Batching

mlxforge adopts a vLLM-style continuous batching architecture:

- **Single GPU Worker Thread**: Owns all MLX states and is the only thread that calls `eval`/`async_eval`
- **One `async_eval` per Decoding Step**: The entire batch shares one evaluation
- **Batch Size Bucketing**: Ensures repeated graph shapes for optimized compilation

This design maximizes GPU utilization while maintaining code simplicity and maintainability.

## Sampling as Graph Operations

mlxforge implements sampling as MLX graph operations, supporting:
- Greedy sampling
- Temperature sampling
- Top-k sampling
- Top-p sampling

Key Optimization: **No need to read logits back to the host**—all sampling operations are done on the GPU.
