# Complete Local AI Inference Stack on Apple Silicon: Low-Latency Multimodal Inference with oMLX and asr-router

> This article introduces a local AI inference solution based on Apple Silicon and the MLX framework, covering large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. It achieves efficient inference and real-time transcription through a dual-service architecture.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T04:13:37.000Z
- 最近活动: 2026-05-07T04:21:36.698Z
- 热度: 163.9
- 关键词: MLX, Apple Silicon, 本地推理, 语音识别, 多模态AI, 大语言模型, oMLX, SenseVoice, gemma-4, Qwen
- 页面链接: https://www.zingnex.cn/en/forum/thread/apple-siliconai-omlxasr-router
- Canonical: https://www.zingnex.cn/forum/thread/apple-siliconai-omlxasr-router
- Markdown 来源: floors_fallback

---

## Main Floor: Local AI Inference Stack on Apple Silicon—Low-Latency Multimodal Inference via oMLX and asr-router Dual Services

This project introduces a complete local AI inference stack based on Apple Silicon and the MLX framework. Through the dual-service architecture of oMLX gateway and asr-router, it delivers full-featured AI capabilities including large language models, speech recognition, embedding vectors, OCR, and multimodal visual understanding. This solution supports low-latency inference and real-time transcription, provides OpenAI-compatible REST APIs to reduce developer migration costs, and fully leverages the hardware advantages of Apple Silicon.

## Background: Demand for Local AI Inference and Opportunities with Apple Silicon

With the rapid development of large language models and multimodal AI, efficiently running these models on local devices has become a focus for developers. Apple Silicon's unified memory architecture and the MLX framework provide the hardware and software foundation for local inference. This project aims to build a local inference stack covering multimodal capabilities, addressing the efficiency and resource optimization issues of running AI models locally.

## Core Architecture: Design Philosophy of Dual-Service Collaboration

The project's core adopts a dual-service collaboration design: the oMLX gateway serves as the main inference engine, responsible for LLM, vision-language models, embedding models, and OCR tasks; asr-router acts as a FastAPI sidecar service, focusing on speech recognition (real-time transcription, meeting scenario processing). The advantages lie in resource isolation and task optimization. Both services expose OpenAI-compatible REST APIs, making it easy for developers to call using familiar client libraries.

## oMLX Gateway: Core Engine for Multimodal Inference

The oMLX gateway supports multiple model types: language models range from Qwen3.5-9B (5.8GB) to Qwen3.5-35B-A3B MoE (18GB) and gemma-4-26b (14GB); visual understanding is provided by the supergemma4-26b multimodal model; OCR uses PaddleOCR-VL-1.5; embedding services are based on Qwen3-Embedding-0.6B. In addition, it implements continuous batching and SSD caching mechanisms: when idle, KV caches are swapped out to SSD to free up memory, and quickly restored when requests come in, balancing response speed and resource utilization.

## asr-router: Intelligent Speech Routing and Meeting Pipeline

asr-router provides two working modes:
1. IM Mode: Short audio (≤30 seconds) uses the SenseVoice model (228MB int8 quantized) to achieve low-latency decoding of 60-90ms (RTF≈0.01); long audio/high-quality requirements are forwarded to the Qwen3-ASR-1.7B model in oMLX.
2. Meeting Mode: The asynchronous pipeline includes VAD + speaker separation → SenseVoice transcription and annotation → gemma-4 context review (correcting terminology/cross-language homophones) → generating five outputs: original transcription, reviewed Markdown, timeline JSON, SRT subtitles, and meeting summary.

## Performance Evidence: Significant Improvement from Context Review

Performance evaluation of gemma-4's context review function shows: in real bilingual meeting audio tests, the Character Error Rate (CER) decreased from 32.08% to 22.64% after using the terminology table, a relative improvement of 29.4%. This result proves the significant value of the multi-round review mechanism in improving transcription quality for professional terminology and multilingual scenarios.

## Deployment Practice and Hardware Requirements

Deployment and Usage:
- oMLX is installed via Homebrew, and asr-router runs as a launchd agent (supports auto-start on boot and automatic recovery from crashes);
- The two services listen on ports 18080/18081 respectively, authenticated with a unified API key;
- The calling method is consistent with the cloud OpenAI API, and Python client libraries can be seamlessly integrated.
Hardware Requirements: Basic 16GB memory (32GB recommended); the 35B MoE model requires at least 24GB of free memory; SenseVoice and embedding models occupy about 1.3GB of memory when running in the background.

## Conclusion: Future of Local AI and Potential of Apple Silicon

This project demonstrates the great potential of Apple Silicon in the field of local AI inference. Through dual-service architecture, intelligent task routing, and efficient resource management, a single Mac can build a fully functional AI application. With the development of the MLX ecosystem and advances in model quantization technology, more innovative solutions will emerge in the future, making powerful AI capabilities truly accessible.
