Zing Forum

Reading

GenMLX: Building Large Model Inference Clusters with Multiple Apple Silicon Macs

GenMLX is an open-source project that connects multiple Apple Silicon Macs (M-series) via Thunderbolt 5 network to form a tensor parallel inference cluster for running large-parameter language models. It supports Web UI management, OpenAI-compatible API, L2 disk cache, and heterogeneous memory configuration, with deployment achievable in 15 minutes.

Apple SiliconMLX大语言模型分布式推理Thunderbolt 5张量并行本地部署机器学习Mac开源
Published 2026-06-04 19:15Recent activity 2026-06-04 19:18Estimated read 6 min
GenMLX: Building Large Model Inference Clusters with Multiple Apple Silicon Macs
1

Section 01

GenMLX: Open-Source Project for Apple Silicon Macs to Build Large Model Inference Clusters

GenMLX is an open-source project that connects multiple Apple Silicon Macs (M-series) via Thunderbolt 5 to form a tensor parallel inference cluster for running large parameter language models. Key features include Web UI management, OpenAI-compatible API, L2 disk cache, heterogeneous memory configuration, and deployment in 15 minutes. It addresses the memory bottleneck of single Macs for large models.

2

Section 02

Background & Problem Solved

Traditional single-machine inference on Apple Silicon Macs is limited by unified memory capacity, making it hard to run models over 100B parameters. GenMLX, built on Apple's MLX framework, uses Thunderbolt5's high-speed network to create a distributed cluster, breaking this limit and allowing integration of multiple Mac devices (M1 Max, M3 Ultra, etc.) into a unified inference engine.

3

Section 03

Core Architecture & Technical Principles

Control Plane (Master-Agent): Master node manages Web UI, REST API, registry, and task scheduling; Agent runs on each worker node, responding to Master commands with HTTP + Bearer Token (no SSH keys needed).

Data Plane (Dispatcher): FastAPI-based core service wrapping mlx-lm, supporting continuous batching and L2 cache, using mx.distributed for node communication over Thunderbolt5.

Network Flexibility: Supports TB5 RDMA (best performance), TB4/3 RDMA, and 10/1 GbE as backup; mesh setup wizard auto-generates IP plans for 1-6 nodes (full mesh/ring topology).

4

Section 04

Key Functional Features

Heterogeneous Memory Support: Automatically chooses tensor parallel (homogeneous) or pipeline parallel (heterogeneous) for mixed Mac configs (e.g., 192GB Mac Studio +32GB Mac mini +96GB MacBook Pro).

L2 Disk Cache: 200GB+ SSD cache for KV state, reduces cold start prefill from 88 mins to 37 secs, saves snapshots at system prompt boundaries for reuse.

API Compatibility: OpenAI-compatible API (/v1/chat/completions etc.), native Anthropic API adapter, tool/function call support, thinking token routing.

5

Section 05

Deployment & Usage Experience

Quick Installation: Master node via curl | bash --master (installs Python3.11+uv+macmon, sets venv, generates token, launchd service, opens UI at localhost:6789). Worker node via curl | bash --agent with master URL and token (registers in 30 secs).

Web UI: Manages model lifecycle (download/sync/serve), checks model presence across nodes, real-time telemetry (CPU/GPU/RAM/SSD), config panel for tools like Claude Code.

6

Section 06

Performance & Limitations

Current State: Pre-alpha (v0.1.0.dev0, phase 0 of 7).

Hardware Reqs: Apple Silicon Macs, Thunderbolt5 recommended, 1-6 nodes.

Differences from Similar Projects: Focuses on fixed, owned topology (1-6 Macs on private network); EXO Labs is better for elastic/dynamic device discovery across mobile/desktop.

7

Section 07

Practical Significance & Application Scenarios

GenMLX solves scenarios like: 1. Privacy-first local inference (no API keys, data stays local). 2. Hardware asset reuse (integrate existing Macs).3. Local deployment of large models (DeepSeek V4, Qwen3-Coder-Next, GLM-4.7 etc.).4. Integration with tools (Claude Code, Cline, OpenWebUI).

8

Section 08

Conclusion & Future Outlook

GenMLX is an important attempt at distributed AI inference in the Apple Silicon ecosystem. It leverages Thunderbolt5 and MLX framework to enable local large model runs. Its architecture (control/data plane separation, heterogeneous support, API compatibility) caters to real deployment needs. As it matures (target v1.0.0), it's expected to become a top choice for Apple Silicon users to deploy local large models.