Reading

mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering OpenAI-compatible HTTP APIs and continuous batching capabilities. It loads raw safetensors weights, runs numerically correct transformer forward passes on Metal GPUs, and serves concurrent users via a vLLM-style single worker thread/three-queue scheduler.

MLXApple SiliconLLaMA推理引擎C++连续批处理OpenAI兼容MetalKV缓存数值正确性

Published 2026-06-02 06:07Recent activity 2026-06-02 06:23Estimated read 6 min

Section 01

Introduction / Main Post: mlxforge: A LLaMA Inference Engine Built From Scratch on Apple MLX

Section 02

Original Author and Source

Original Author/Maintainer: hvasconcelos
Source Platform: GitHub
Original Title: mlxforge
Original Link: https://github.com/hvasconcelos/mlxforge
Publication Date: June 1, 2026

Section 03

Introduction: Why Build an Inference Engine From Scratch?

In the field of AI inference, most developers choose to use off-the-shelf frameworks—vLLM, TensorRT-LLM, llama.cpp, etc. These tools are heavily optimized and feature-rich, but they are black boxes. When you need to understand every numerically sensitive stage of a transformer, or implement specific optimizations on Apple Silicon, off-the-shelf solutions may not meet your needs.

mlxforge takes a different path: building a complete LLaMA inference engine from scratch in C++ on Apple's MLX framework. This is not to reinvent the wheel, but to deeply understand how the wheel turns.

Section 04

What is mlxforge?

mlxforge is a LLaMA inference engine built from scratch in C++ on the Apple MLX framework, offering:

OpenAI-compatible HTTP APIs: Endpoints like /v1/chat/completions, /v1/completions, /v1/models
Continuous Batching: vLLM-style single worker thread/three-queue scheduler
Numerical Correctness: Every numerically sensitive stage is validated against the mlx-lm golden standard
KV Cache: Single-sequence and batch caches with support for filter/eviction and merge/admission

Target Model: mlx-community/Llama-3.2-1B-Instruct (default fp16, optional 4-bit quantization)

Section 05

Numerical Correctness

One of mlxforge's core design principles is numerical correctness. The forward pass logits and greedy tokens exactly match those of mlx-lm. Golden standard .npy fixtures act as a gate for each step, because the failure mode here is silent garbage output rather than crashes.

This strict validation ensures:

Model behavior is consistent with the reference implementation
Numerical results can be trusted during debugging
Predictability in production environments

Section 06

KV Cache Architecture

mlxforge implements two KV cache modes:

Single Sequence Cache (SingleKVCache):

Cache optimized for a single sequence
Supports left-padded layout
256-token block growth strategy

Batch Cache (BatchKVCache):

Supports multi-sequence batching
update_and_fetch: Update and retrieve cache state
filter: Evict unwanted tokens
merge: Admit new sequences
pad_dummies: Handle variable-length sequences

Section 07

Continuous Batching

mlxforge adopts a vLLM-style continuous batching architecture:

Single GPU Worker Thread: Owns all MLX states and is the only thread that calls eval/async_eval
One async_eval per Decoding Step: The entire batch shares one evaluation
Batch Size Bucketing: Ensures repeated graph shapes for optimized compilation

This design maximizes GPU utilization while maintaining code simplicity and maintainability.

Section 08

Sampling as Graph Operations

mlxforge implements sampling as MLX graph operations, supporting:

Greedy sampling
Temperature sampling
Top-k sampling
Top-p sampling

Key Optimization: No need to read logits back to the host—all sampling operations are done on the GPU.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15