Zing Forum

Reading

Complete Local AI Inference Stack on Apple Silicon: Low-Latency Multimodal Inference with oMLX and asr-router

This article introduces a local AI inference solution based on Apple Silicon and the MLX framework, covering large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. It achieves efficient inference and real-time transcription through a dual-service architecture.

MLXApple Silicon本地推理语音识别多模态AI大语言模型oMLXSenseVoicegemma-4Qwen
Published 2026-05-07 12:13Recent activity 2026-05-07 12:21Estimated read 7 min
Complete Local AI Inference Stack on Apple Silicon: Low-Latency Multimodal Inference with oMLX and asr-router
1

Section 01

Main Floor: Local AI Inference Stack on Apple Silicon—Low-Latency Multimodal Inference via oMLX and asr-router Dual Services

This project introduces a complete local AI inference stack based on Apple Silicon and the MLX framework. Through the dual-service architecture of oMLX gateway and asr-router, it delivers full-featured AI capabilities including large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. This solution supports low-latency inference and real-time transcription, provides OpenAI-compatible REST APIs to reduce developer migration costs, and fully leverages the hardware advantages of Apple Silicon.

2

Section 02

Background: Demand for Local AI Inference and Opportunities with Apple Silicon

With the rapid development of large language models and multimodal AI, efficiently running these models on local devices has become a focus for developers. Apple Silicon's unified memory architecture and the MLX framework provide the hardware and software foundation for local inference. This project aims to build a local inference stack covering multimodal capabilities, addressing the efficiency and resource optimization issues of running AI models locally.

3

Section 03

Core Architecture: Design Philosophy of Dual-Service Collaboration

The project's core adopts a dual-service collaboration design: the oMLX gateway serves as the main inference engine, responsible for LLM, vision-language models, embedding models, and OCR tasks; asr-router acts as a FastAPI sidecar service, focusing on speech recognition (real-time transcription, meeting scenario processing). The advantages lie in resource isolation and task optimization. Both services expose OpenAI-compatible REST APIs, making it easy for developers to call using familiar client libraries.

4

Section 04

oMLX Gateway: Core Engine for Multimodal Inference

The oMLX gateway supports multiple model types: language models range from Qwen3.5-9B (5.8GB) to Qwen3.5-35B-A3B MoE (18GB) and gemma-4-26b (14GB); visual understanding is provided by the supergemma4-26b multimodal model; OCR uses PaddleOCR-VL-1.5; embedding services are based on Qwen3-Embedding-0.6B. In addition, it implements continuous batching and SSD caching mechanisms: when idle, KV caches are swapped out to SSD to free up memory, and quickly restored when requests come in, balancing response speed and resource utilization.

5

Section 05

asr-router: Intelligent Speech Routing and Meeting Pipeline

asr-router provides two working modes:

  1. IM Mode: Short audio (≤30 seconds) uses the SenseVoice model (228MB int8 quantized) to achieve low-latency decoding of 60-90ms (RTF≈0.01); long audio/high-quality requirements are forwarded to the Qwen3-ASR-1.7B model in oMLX.
  2. Meeting Mode: The asynchronous pipeline includes VAD + speaker separation → SenseVoice transcription and annotation → gemma-4 context review (correcting terminology/cross-language homophones) → generating five outputs: original transcription, reviewed Markdown, timeline JSON, SRT subtitles, and meeting summary.
6

Section 06

Performance Evidence: Significant Improvement from Context Review

Performance evaluation of gemma-4's context review function shows: in real bilingual meeting audio tests, the Character Error Rate (CER) decreased from 32.08% to 22.64% after using the terminology table, a relative improvement of 29.4%. This result proves the significant value of the multi-round review mechanism in improving transcription quality for professional terminology and multilingual scenarios.

7

Section 07

Deployment Practice and Hardware Requirements

Deployment and Usage:

  • oMLX is installed via Homebrew, and asr-router runs as a launchd agent (supports auto-start on boot and automatic recovery from crashes);
  • The two services listen on ports 18080/18081 respectively, authenticated with a unified API key;
  • The calling method is consistent with the cloud OpenAI API, and Python client libraries can be seamlessly integrated. Hardware Requirements: Basic 16GB memory (32GB recommended); the 35B MoE model requires at least 24GB of free memory; SenseVoice and embedding models occupy about 1.3GB of memory when running in the background.
8

Section 08

Conclusion: Future of Local AI and Potential of Apple Silicon

This project demonstrates the great potential of Apple Silicon in the field of local AI inference. Through dual-service architecture, intelligent task routing, and efficient resource management, a single Mac can build a fully functional AI application. With the development of the MLX ecosystem and advances in model quantization technology, more innovative solutions will emerge in the future, making powerful AI capabilities truly accessible.