Reading

Complete Local AI Inference Stack on Apple Silicon: Low-Latency Multimodal Inference with oMLX and asr-router

This article introduces a local AI inference solution based on Apple Silicon and the MLX framework, covering large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. It achieves efficient inference and real-time transcription through a dual-service architecture.

MLXApple Silicon本地推理语音识别多模态AI大语言模型oMLXSenseVoicegemma-4Qwen

Published 2026-05-07 12:13Recent activity 2026-05-07 12:21Estimated read 7 min

Complete Local AI Inference Stack on Apple Silicon: Low-Latency Multimodal Inference with oMLX and asr-router

Section 01

Main Floor: Local AI Inference Stack on Apple Silicon—Low-Latency Multimodal Inference via oMLX and asr-router Dual Services

This project introduces a complete local AI inference stack based on Apple Silicon and the MLX framework. Through the dual-service architecture of oMLX gateway and asr-router, it delivers full-featured AI capabilities including large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. This solution supports low-latency inference and real-time transcription, provides OpenAI-compatible REST APIs to reduce developer migration costs, and fully leverages the hardware advantages of Apple Silicon.

Section 02

Background: Demand for Local AI Inference and Opportunities with Apple Silicon

With the rapid development of large language models and multimodal AI, efficiently running these models on local devices has become a focus for developers. Apple Silicon's unified memory architecture and the MLX framework provide the hardware and software foundation for local inference. This project aims to build a local inference stack covering multimodal capabilities, addressing the efficiency and resource optimization issues of running AI models locally.

Section 03

Core Architecture: Design Philosophy of Dual-Service Collaboration

The project's core adopts a dual-service collaboration design: the oMLX gateway serves as the main inference engine, responsible for LLM, vision-language models, embedding models, and OCR tasks; asr-router acts as a FastAPI sidecar service, focusing on speech recognition (real-time transcription, meeting scenario processing). The advantages lie in resource isolation and task optimization. Both services expose OpenAI-compatible REST APIs, making it easy for developers to call using familiar client libraries.

Section 04

oMLX Gateway: Core Engine for Multimodal Inference

The oMLX gateway supports multiple model types: language models range from Qwen3.5-9B (5.8GB) to Qwen3.5-35B-A3B MoE (18GB) and gemma-4-26b (14GB); visual understanding is provided by the supergemma4-26b multimodal model; OCR uses PaddleOCR-VL-1.5; embedding services are based on Qwen3-Embedding-0.6B. In addition, it implements continuous batching and SSD caching mechanisms: when idle, KV caches are swapped out to SSD to free up memory, and quickly restored when requests come in, balancing response speed and resource utilization.

Section 05

asr-router: Intelligent Speech Routing and Meeting Pipeline

asr-router provides two working modes:

IM Mode: Short audio (≤30 seconds) uses the SenseVoice model (228MB int8 quantized) to achieve low-latency decoding of 60-90ms (RTF≈0.01); long audio/high-quality requirements are forwarded to the Qwen3-ASR-1.7B model in oMLX.
Meeting Mode: The asynchronous pipeline includes VAD + speaker separation → SenseVoice transcription and annotation → gemma-4 context review (correcting terminology/cross-language homophones) → generating five outputs: original transcription, reviewed Markdown, timeline JSON, SRT subtitles, and meeting summary.

Section 06

Performance Evidence: Significant Improvement from Context Review

Performance evaluation of gemma-4's context review function shows: in real bilingual meeting audio tests, the Character Error Rate (CER) decreased from 32.08% to 22.64% after using the terminology table, a relative improvement of 29.4%. This result proves the significant value of the multi-round review mechanism in improving transcription quality for professional terminology and multilingual scenarios.

Section 07

Deployment Practice and Hardware Requirements

Deployment and Usage:

oMLX is installed via Homebrew, and asr-router runs as a launchd agent (supports auto-start on boot and automatic recovery from crashes);
The two services listen on ports 18080/18081 respectively, authenticated with a unified API key;
The calling method is consistent with the cloud OpenAI API, and Python client libraries can be seamlessly integrated. Hardware Requirements: Basic 16GB memory (32GB recommended); the 35B MoE model requires at least 24GB of free memory; SenseVoice and embedding models occupy about 1.3GB of memory when running in the background.

Section 08

Conclusion: Future of Local AI and Potential of Apple Silicon

This project demonstrates the great potential of Apple Silicon in the field of local AI inference. Through dual-service architecture, intelligent task routing, and efficient resource management, a single Mac can build a fully functional AI application. With the development of the MLX ecosystem and advances in model quantization technology, more innovative solutions will emerge in the future, making powerful AI capabilities truly accessible.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15