Reading

mlx-lm-server: High-Performance LLM Inference Server on Apple Silicon

An OpenAI-compatible MLX LLM inference server written in Rust, optimized for Apple Silicon. It embeds Python via PyO3 to enable Metal acceleration, offering low memory usage, fast cold start, and rich API features.

MLXApple SiliconRustLLM推理OpenAI API本地AIPyO3LoRA推测解码模型微调

Published 2026-06-10 15:40Recent activity 2026-06-10 15:49Estimated read 6 min

Section 01

[Introduction] mlx-lm-server: High-Performance LLM Inference Server on Apple Silicon

This article introduces mlx-lm-server, an open-source LLM inference server optimized for Apple Silicon. Written in Rust, it embeds Python via PyO3 to enable Metal acceleration. Key features include low memory usage (only 8MB idle), fast cold start (16ms), full OpenAI API compatibility, support for LoRA hot-swapping, speculative decoding, multi-modal model routing, etc. It can serve as an efficient solution for local AI deployment.

Section 02

Project Background and Overview

mlx-lm-server is developed and maintained by Ar9av, with source code hosted on GitHub (link: https://github.com/Ar9av/mlx-lm-server). It was released on June 10, 2026, and was showcased by Apple at WWDC2025 as an example project for "Building Local AI Agents on Mac with MLX". Compared to traditional Python servers, it has significant advantages in resource usage and startup speed: 8MB idle memory vs. 60-100MB for Python servers, and 16ms cold start vs. 3-5 seconds.

Section 03

Core Architecture Design

The project uses a Rust+PyO3 hybrid architecture: Rust handles HTTP services and concurrent processing (via tokio+axum framework), while model inference calls the MLX Python library through PyO3 to enable Metal acceleration, avoiding the Python GIL bottleneck. It includes two server components: mlx-lm-server (port 8080, focusing on LLM chat completion, LoRA, and vision model support) and mlx-audio-server (port 8001, providing audio functions like TTS/STT). Both are single binary files with a static memory footprint of about 8MB.

Section 04

Key Features

Full OpenAI API compatibility: Supports chat completion, text completion, embeddings, and Anthropic Messages API. Existing OpenAI client code can be used without modification.
LoRA adapter hot-swapping: Dynamically load/unload/switch at runtime; requests specify via adapter_name.
Speculative decoding: Enable draft models to improve throughput.
KV cache quantization: 4/8-bit precision reduces memory usage for long contexts.
Tool calling: Supports OpenAI-style function calls (compatible with models like Llama-3/Qwen).
Model management: Scan local HuggingFace cache, Hub search, and Ollama-compatible endpoints.

Section 05

Performance Benchmarks and Memory Protection

Test results using the Llama-3.2-1B-Instruct-4bit model on Apple M-series chips: 16ms cold start, 2.4s model loading (cached), 8MB idle memory, streaming throughput of 115-261 tok/s, first token time of 86-96ms, and 4 concurrent requests processed in 0.37s without errors. Built-in RAM protection: Checks available memory before loading models to avoid system crashes.

Section 06

Application Scenarios and Community Ecosystem

Application scenarios include: Local AI development (zero configuration, no network latency), privacy-sensitive applications (local data processing), offline environments (no network available), and rapid prototype validation (low latency accelerates iteration). This project is based on the Apple MLX ecosystem, closely integrated with mlx-lm/mlx-vlm/mlx-audio, and was recognized by Apple as a WWDC2025 showcase project. The open-source community is active and evolving continuously.

Section 07

Deployment and Usage Examples

Quick start: ./run.sh lm (LLM server), ./run.sh audio (audio server); Model loading: Via POST /v1/models/load endpoint; Chat completion: POST /v1/chat/completions endpoint (supports stream parameter); Fine-tuning workflow: Training (/v1/train) → Mount adapter → Use → Merge model, all done via API.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23