Zing Forum

Reading

vllm-mlx: Natively Run High-Performance Large Language Models and Multimodal Inference on Apple Silicon

vllm-mlx brings vLLM's high-throughput inference capabilities to Apple Silicon, enabling native GPU acceleration via the MLX framework. It supports multimodal processing of text, images, video, and audio, is compatible with OpenAI and Anthropic APIs, and allows Mac users to run large models locally at a generation speed of over 400 tokens per second.

vllm-mlxApple SiliconMLX大语言模型本地推理多模态OpenAI APIClaude CodeMCP语音合成
Published 2026-04-01 00:45Recent activity 2026-04-01 01:21Estimated read 6 min
vllm-mlx: Natively Run High-Performance Large Language Models and Multimodal Inference on Apple Silicon
1

Section 01

vllm-mlx: A High-Performance Multimodal Inference Solution for Apple Silicon

The vllm-mlx project integrates vLLM's high-throughput inference capabilities with Apple's native MLX framework, addressing the pain points of Apple Silicon users running large models locally. It supports multimodal processing of text, images, video, and audio, is compatible with OpenAI and Anthropic APIs, and allows Mac users to run large models locally at a generation speed of over 400 tokens per second, achieving an experience comparable to that of Linux+CUDA environments.

2

Section 02

Project Background: Pain Points and Solutions for Large Model Inference on Apple Silicon

Apple Silicon users have long faced issues where mainstream inference frameworks (such as vLLM) rely on CUDA, leaving the Metal architecture marginalized. While there are independent solutions like mlx-lm, they lack unified APIs and ecosystem integration. vllm-mlx adheres to the principle of 'compatibility first, performance foremost', bridging the vLLM API and MLX framework via the MLXPlatform plugin layer to inherit the advantages of both.

3

Section 03

Technical Architecture: Layered Design Balancing Compatibility and Performance

vllm-mlx uses a layered structure: the upper layer is vLLM's OpenAI/Anthropic-compatible API layer; the middle layer is the MLXPlatform plugin layer, which converts vLLM calls into MLX instructions; the bottom layer integrates libraries such as mlx-lm (LLM inference), mlx-vlm (multimodal), and mlx-audio (audio and video). The modular design supports independent evolution of components, while leveraging Apple's unified memory architecture to simplify memory management and reduce data copy overhead.

4

Section 04

Performance Evidence: High Speed and Throughput on Apple Silicon

Benchmark tests show that on M4 Max, Llama-3.2-1B-4bit reaches 464 tokens/s, Llama-3.2-3B-4bit reaches 200 tokens/s, and Qwen3-0.6B-8bit reaches 402 tokens/s. With continuous batching, when there are 5 concurrent requests, the total throughput of Qwen3-0.6B increases from 328 tokens/s to 1112 tokens/s (3.4x), and Llama-3.2-1B increases from 299 tokens/s to 613 tokens/s (2x). For audio, Whisper-Tiny achieves a real-time factor of 197x, and Whisper-Large-V3 reaches 24x.

5

Section 05

Multimodal Capabilities and API Compatibility: One-Stop Service and Seamless Migration

vllm-mlx supports text generation, image understanding (e.g., Qwen-VL), video analysis, speech synthesis (Kokoro and other 10+ languages), and speech recognition. Its APIs are compatible with OpenAI Chat Completions and Anthropic Messages API. Developers can directly connect to local services using official SDKs—for example, setting ANTHROPIC_BASE_URL allows switching Claude Code to the underlying model.

6

Section 06

Advanced Features: Inference Parsing and Tool Call Extensions

vllm-mlx supports an inference process parser. When enabled, API responses include a 'reasoning' field that separates the chain of thought from the answer, facilitating research and debugging. It also supports Anthropic's MCP tool call protocol, enabling integration with external tools (file systems, databases, etc.) to implement complex agent tasks.

7

Section 07

Deployment and Scenarios: Flexible Installation and Multi-Scenario Adaptation

Installation is flexible, available via uv tool install or pip. Example startup command: vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching. It is suitable for scenarios such as personal local services (privacy protection), small team shared nodes, and researchers' model experiments.

8

Section 08

Ecological Significance and Future: A Key Breakthrough for Apple Silicon AI Ecosystem

vllm-mlx breaks the stereotype that 'AI development relies on NVIDIA GPUs' and proves that Apple Silicon can handle large model loads. In the future, it will follow upstream updates of vLLM, introduce more quantization strategies and optimizations, replace cloud services in privacy-sensitive, offline deployment, and cost-sensitive scenarios, and promote Mac developers' participation in large model application development.