Zing Forum

Reading

vllm-swift: A High-Performance LLM Inference Engine for Apple Silicon

vllm-swift is a native backend based on Swift and Metal, providing high-performance inference capabilities for vLLM on Apple Silicon. It eliminates Python overhead in the inference hot path through pure Swift/Metal implementation, achieving up to 2.4x throughput improvement in low-concurrency scenarios.

vLLMApple SiliconSwiftMetalLLM推理mlx-swiftKV缓存压缩本地部署
Published 2026-04-24 00:42Recent activity 2026-04-24 00:51Estimated read 6 min
vllm-swift: A High-Performance LLM Inference Engine for Apple Silicon
1

Section 01

Introduction / Main Post: vllm-swift: A High-Performance LLM Inference Engine for Apple Silicon

vllm-swift is a native backend based on Swift and Metal, providing high-performance inference capabilities for vLLM on Apple Silicon. It eliminates Python overhead in the inference hot path through pure Swift/Metal implementation, achieving up to 2.4x throughput improvement in low-concurrency scenarios.

2

Section 02

Project Background

With the rapid development of Large Language Models (LLMs), the demand for local inference is growing. Apple Silicon has become a popular platform for local LLM deployment due to its unified memory architecture and powerful neural engine. However, the traditional vLLM Metal backend still relies on Python and the MLX framework, which introduces significant overhead in the inference hot path. The vllm-swift project was born to completely eliminate Python's performance bottlenecks in inference through pure Swift/Metal implementation.

3

Section 03

Core Architecture

vllm-swift adopts a layered architecture design, completely moving Python out of the inference hot path:

  • Python Layer: Responsible only for vLLM API, tokenization, and scheduling coordination
  • C Bridge Layer: Enables communication between Python and Swift via ctypes FFI
  • Swift Layer: Core inference engine, implemented based on mlx-swift-lm
  • Metal GPU: Underlying computation acceleration

This architecture ensures that forward propagation is fully executed in Swift/Metal, with Python only used for orchestration, resulting in significant performance improvements.

4

Section 04

Performance Advantages

According to official benchmark tests, vllm-swift performs particularly well in low-concurrency scenarios:

5

Section 05

Short Context Decoding Performance (Prompt=18 tokens, Generation=50 tokens)

Concurrency vllm-swift vllm-metal (Python/MLX) Improvement Multiple
Single 340 tok/s 142 tok/s 2.4x
8 1,512 tok/s 1,170 tok/s 1.3x
32 2,862 tok/s 2,457 tok/s 1.16x
64 3,383 tok/s 3,017 tok/s 1.12x
6

Section 06

Long Context Decoding Performance

Concurrency vllm-swift vllm-metal (Python/MLX)
Single 149 tok/s 105 tok/s
64 1,519 tok/s 1,387 tok/s

From the data, it is evident that vllm-swift's advantages are most pronounced in low-concurrency scenarios, which are typical use cases for individual users and small-to-medium scale deployments.

7

Section 07

TurboQuant+ KV Cache Compression

vllm-swift integrates TurboQuant+ technology, supporting 3-5x compression of KV cache while maintaining almost lossless model quality:

Scheme Compression Ratio 1K PPL 32K PPL Application Scenario
FP16 1.0x 2.72 4.40 Baseline Comparison
turbo4v2 3.2x 3.22 3.72 Balance between Quality and Compression
turbo3 4.6x 3.95 3.89 Maximum Compression, Long Context

After enabling KV cache compression, users can run longer context windows on Apple Silicon devices without significantly affecting inference speed.

8

Section 08

Key Features

vllm-swift provides a complete OpenAI-compatible API, including:

  • OpenAI-compatible Interface: Supports /v1/completions and /v1/chat/completions endpoints
  • Streaming Response: Supports SSE streaming output
  • Chat Template: Automatically applies model-specific chat templates
  • Batch Decoding: Implements fully batched projection and attention computation via BatchedKVCache
  • Temperature Sampling: Supports per-request temperature sampling in the batch path
  • Automatic Model Download: Supports automatic model downloading from HuggingFace Hub
  • Tool Calling: Supports enabling automatic tool selection via --enable-auto-tool-choice
  • VLM Support: Experimental support for Vision-Language Models