Reading

vllm-mlx: Natively Run High-Performance Large Language Models and Multimodal Inference on Apple Silicon

vllm-mlx brings vLLM's high-throughput inference capabilities to Apple Silicon, enabling native GPU acceleration via the MLX framework. It supports multimodal processing of text, images, video, and audio, is compatible with OpenAI and Anthropic APIs, and allows Mac users to run large models locally at a generation speed of over 400 tokens per second.

vllm-mlxApple SiliconMLX大语言模型本地推理多模态OpenAI APIClaude CodeMCP语音合成

Published 2026-04-01 00:45Recent activity 2026-04-01 01:21Estimated read 6 min

vllm-mlx: Natively Run High-Performance Large Language Models and Multimodal Inference on Apple Silicon

Section 01

vllm-mlx: A High-Performance Multimodal Inference Solution for Apple Silicon

The vllm-mlx project integrates vLLM's high-throughput inference capabilities with Apple's native MLX framework, addressing the pain points of Apple Silicon users running large models locally. It supports multimodal processing of text, images, video, and audio, is compatible with OpenAI and Anthropic APIs, and allows Mac users to run large models locally at a generation speed of over 400 tokens per second, achieving an experience comparable to that of Linux+CUDA environments.

Section 02

Project Background: Pain Points and Solutions for Large Model Inference on Apple Silicon

Apple Silicon users have long faced issues where mainstream inference frameworks (such as vLLM) rely on CUDA, leaving the Metal architecture marginalized. While there are independent solutions like mlx-lm, they lack unified APIs and ecosystem integration. vllm-mlx adheres to the principle of 'compatibility first, performance foremost', bridging the vLLM API and MLX framework via the MLXPlatform plugin layer to inherit the advantages of both.

Section 03

Technical Architecture: Layered Design Balancing Compatibility and Performance

vllm-mlx uses a layered structure: the upper layer is vLLM's OpenAI/Anthropic-compatible API layer; the middle layer is the MLXPlatform plugin layer, which converts vLLM calls into MLX instructions; the bottom layer integrates libraries such as mlx-lm (LLM inference), mlx-vlm (multimodal), and mlx-audio (audio and video). The modular design supports independent evolution of components, while leveraging Apple's unified memory architecture to simplify memory management and reduce data copy overhead.

Section 04

Performance Evidence: High Speed and Throughput on Apple Silicon

Benchmark tests show that on M4 Max, Llama-3.2-1B-4bit reaches 464 tokens/s, Llama-3.2-3B-4bit reaches 200 tokens/s, and Qwen3-0.6B-8bit reaches 402 tokens/s. With continuous batching, when there are 5 concurrent requests, the total throughput of Qwen3-0.6B increases from 328 tokens/s to 1112 tokens/s (3.4x), and Llama-3.2-1B increases from 299 tokens/s to 613 tokens/s (2x). For audio, Whisper-Tiny achieves a real-time factor of 197x, and Whisper-Large-V3 reaches 24x.

Section 05

Multimodal Capabilities and API Compatibility: One-Stop Service and Seamless Migration

vllm-mlx supports text generation, image understanding (e.g., Qwen-VL), video analysis, speech synthesis (Kokoro and other 10+ languages), and speech recognition. Its APIs are compatible with OpenAI Chat Completions and Anthropic Messages API. Developers can directly connect to local services using official SDKs—for example, setting ANTHROPIC_BASE_URL allows switching Claude Code to the underlying model.

Section 06

Advanced Features: Inference Parsing and Tool Call Extensions

vllm-mlx supports an inference process parser. When enabled, API responses include a 'reasoning' field that separates the chain of thought from the answer, facilitating research and debugging. It also supports Anthropic's MCP tool call protocol, enabling integration with external tools (file systems, databases, etc.) to implement complex agent tasks.

Section 07

Deployment and Scenarios: Flexible Installation and Multi-Scenario Adaptation

Installation is flexible, available via uv tool install or pip. Example startup command: vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching. It is suitable for scenarios such as personal local services (privacy protection), small team shared nodes, and researchers' model experiments.

Section 08

Ecological Significance and Future: A Key Breakthrough for Apple Silicon AI Ecosystem

vllm-mlx breaks the stereotype that 'AI development relies on NVIDIA GPUs' and proves that Apple Silicon can handle large model loads. In the future, it will follow upstream updates of vLLM, introduce more quantization strategies and optimizations, replace cloud services in privacy-sensitive, offline deployment, and cost-sensitive scenarios, and promote Mac developers' participation in large model application development.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15