Reading

vllm-swift: A High-Performance LLM Inference Engine for Apple Silicon

vllm-swift is a native backend based on Swift and Metal, providing high-performance inference capabilities for vLLM on Apple Silicon. It eliminates Python overhead in the inference hot path through pure Swift/Metal implementation, achieving up to 2.4x throughput improvement in low-concurrency scenarios.

vLLMApple SiliconSwiftMetalLLM推理mlx-swiftKV缓存压缩本地部署

Published 2026-04-24 00:42Recent activity 2026-04-24 00:51Estimated read 6 min

Section 01

Introduction / Main Post: vllm-swift: A High-Performance LLM Inference Engine for Apple Silicon

Section 02

Project Background

With the rapid development of Large Language Models (LLMs), the demand for local inference is growing. Apple Silicon has become a popular platform for local LLM deployment due to its unified memory architecture and powerful neural engine. However, the traditional vLLM Metal backend still relies on Python and the MLX framework, which introduces significant overhead in the inference hot path. The vllm-swift project was born to completely eliminate Python's performance bottlenecks in inference through pure Swift/Metal implementation.

Section 03

Core Architecture

vllm-swift adopts a layered architecture design, completely moving Python out of the inference hot path:

Python Layer: Responsible only for vLLM API, tokenization, and scheduling coordination
C Bridge Layer: Enables communication between Python and Swift via ctypes FFI
Swift Layer: Core inference engine, implemented based on mlx-swift-lm
Metal GPU: Underlying computation acceleration

This architecture ensures that forward propagation is fully executed in Swift/Metal, with Python only used for orchestration, resulting in significant performance improvements.

Section 04

Performance Advantages

According to official benchmark tests, vllm-swift performs particularly well in low-concurrency scenarios:

Section 05

Short Context Decoding Performance (Prompt=18 tokens, Generation=50 tokens)

Concurrency	vllm-swift	vllm-metal (Python/MLX)	Improvement Multiple
Single	340 tok/s	142 tok/s	2.4x
8	1,512 tok/s	1,170 tok/s	1.3x
32	2,862 tok/s	2,457 tok/s	1.16x
64	3,383 tok/s	3,017 tok/s	1.12x

Section 06

Long Context Decoding Performance

Concurrency	vllm-swift	vllm-metal (Python/MLX)
Single	149 tok/s	105 tok/s
64	1,519 tok/s	1,387 tok/s

From the data, it is evident that vllm-swift's advantages are most pronounced in low-concurrency scenarios, which are typical use cases for individual users and small-to-medium scale deployments.

Section 07

TurboQuant+ KV Cache Compression

vllm-swift integrates TurboQuant+ technology, supporting 3-5x compression of KV cache while maintaining almost lossless model quality:

Scheme	Compression Ratio	1K PPL	32K PPL	Application Scenario
FP16	1.0x	2.72	4.40	Baseline Comparison
turbo4v2	3.2x	3.22	3.72	Balance between Quality and Compression
turbo3	4.6x	3.95	3.89	Maximum Compression, Long Context

After enabling KV cache compression, users can run longer context windows on Apple Silicon devices without significantly affecting inference speed.

Section 08

Key Features

vllm-swift provides a complete OpenAI-compatible API, including:

OpenAI-compatible Interface: Supports /v1/completions and /v1/chat/completions endpoints
Streaming Response: Supports SSE streaming output
Chat Template: Automatically applies model-specific chat templates
Batch Decoding: Implements fully batched projection and attention computation via BatchedKVCache
Temperature Sampling: Supports per-request temperature sampling in the batch path
Automatic Model Download: Supports automatic model downloading from HuggingFace Hub
Tool Calling: Supports enabling automatic tool selection via --enable-auto-tool-choice
VLM Support: Experimental support for Vision-Language Models

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49