# mlx-lm-server: High-Performance LLM Inference Server on Apple Silicon

> An OpenAI-compatible MLX LLM inference server written in Rust, optimized for Apple Silicon. It embeds Python via PyO3 to enable Metal acceleration, offering low memory usage, fast cold start, and rich API features.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T07:40:24.000Z
- 最近活动: 2026-06-10T07:49:33.935Z
- 热度: 154.8
- 关键词: MLX, Apple Silicon, Rust, LLM推理, OpenAI API, 本地AI, PyO3, LoRA, 推测解码, 模型微调
- 页面链接: https://www.zingnex.cn/en/forum/thread/mlx-lm-server-apple-siliconllm
- Canonical: https://www.zingnex.cn/forum/thread/mlx-lm-server-apple-siliconllm
- Markdown 来源: floors_fallback

---

## [Introduction] mlx-lm-server: High-Performance LLM Inference Server on Apple Silicon

This article introduces mlx-lm-server, an open-source LLM inference server optimized for Apple Silicon. Written in Rust, it embeds Python via PyO3 to enable Metal acceleration. Key features include low memory usage (only 8MB idle), fast cold start (16ms), full OpenAI API compatibility, support for LoRA hot-swapping, speculative decoding, multi-modal model routing, etc. It can serve as an efficient solution for local AI deployment.

## Project Background and Overview

mlx-lm-server is developed and maintained by Ar9av, with source code hosted on GitHub (link: https://github.com/Ar9av/mlx-lm-server). It was released on June 10, 2026, and was showcased by Apple at WWDC2025 as an example project for "Building Local AI Agents on Mac with MLX". Compared to traditional Python servers, it has significant advantages in resource usage and startup speed: 8MB idle memory vs. 60-100MB for Python servers, and 16ms cold start vs. 3-5 seconds.

## Core Architecture Design

The project uses a Rust+PyO3 hybrid architecture: Rust handles HTTP services and concurrent processing (via tokio+axum framework), while model inference calls the MLX Python library through PyO3 to enable Metal acceleration, avoiding the Python GIL bottleneck. It includes two server components: mlx-lm-server (port 8080, focusing on LLM chat completion, LoRA, and vision model support) and mlx-audio-server (port 8001, providing audio functions like TTS/STT). Both are single binary files with a static memory footprint of about 8MB.

## Key Features

1. Full OpenAI API compatibility: Supports chat completion, text completion, embeddings, and Anthropic Messages API. Existing OpenAI client code can be used without modification.
2. LoRA adapter hot-swapping: Dynamically load/unload/switch at runtime; requests specify via adapter_name.
3. Speculative decoding: Enable draft models to improve throughput.
4. KV cache quantization: 4/8-bit precision reduces memory usage for long contexts.
5. Tool calling: Supports OpenAI-style function calls (compatible with models like Llama-3/Qwen).
6. Model management: Scan local HuggingFace cache, Hub search, and Ollama-compatible endpoints.

## Performance Benchmarks and Memory Protection

Test results using the Llama-3.2-1B-Instruct-4bit model on Apple M-series chips: 16ms cold start, 2.4s model loading (cached), 8MB idle memory, streaming throughput of 115-261 tok/s, first token time of 86-96ms, and 4 concurrent requests processed in 0.37s without errors. Built-in RAM protection: Checks available memory before loading models to avoid system crashes.

## Application Scenarios and Community Ecosystem

Application scenarios include: Local AI development (zero configuration, no network latency), privacy-sensitive applications (local data processing), offline environments (no network available), and rapid prototype validation (low latency accelerates iteration). This project is based on the Apple MLX ecosystem, closely integrated with mlx-lm/mlx-vlm/mlx-audio, and was recognized by Apple as a WWDC2025 showcase project. The open-source community is active and evolving continuously.

## Deployment and Usage Examples

Quick start: `./run.sh lm` (LLM server), `./run.sh audio` (audio server); Model loading: Via POST /v1/models/load endpoint; Chat completion: POST /v1/chat/completions endpoint (supports stream parameter); Fine-tuning workflow: Training (/v1/train) → Mount adapter → Use → Merge model, all done via API.