Zing Forum

Reading

GenMLX: Build an LLM Inference Cluster with Multiple Apple Silicon Macs

GenMLX is an open-source project that allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally.

Apple SiliconMLX分布式推理LLMThunderbolt本地部署张量并行集群
Published 2026-06-04 19:15Recent activity 2026-06-04 19:21Estimated read 8 min
GenMLX: Build an LLM Inference Cluster with Multiple Apple Silicon Macs
1

Section 01

GenMLX: An Open-Source Solution for Building LLM Inference Clusters with Multiple Apple Silicon Macs

GenMLX is an open-source project maintained by crystech. It allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally. The project was released on June 4, 2026, and its source code is hosted on GitHub (link: https://github.com/crystech/GenMLX). Its core goal is to help users with multiple Macs make full use of existing hardware resources to run large models locally that cannot fit in the memory of a single device.

2

Section 02

Background: Demand for Local LLM Clusters and GenMLX's Design Intent

As the parameter scale of large language models grows, the memory and computing power of a single device are often insufficient to meet the demand. For users with multiple Apple Silicon Macs, how to use existing hardware to run larger models locally has become an urgent problem to solve. GenMLX is designed for this scenario, based on Apple's MLX framework, using the low-latency network characteristics of Thunderbolt 5 to form a unified inference cluster with multiple Macs.

3

Section 03

Core Architecture: Analysis of the Three-Tier Structure

GenMLX adopts a three-tier architecture:

  1. Master Node: Acts as the cluster coordinator, responsible for hosting the Web UI and REST API, managing the SQLite agent registry, running the grid planner, tracking job status, and running rank 0 of the dispatcher.

  2. Agent Node: A lightweight HTTP daemon runs on each working Mac, responding to commands from the master node, including file synchronization, command execution, rank startup, and grid configuration.

  3. Dispatcher: The inference core (3000+ lines of FastAPI application) encapsulates mlx-lm, responsible for continuous batching, L2 cache management, thought token and tool call parsing, and provides OpenAI/Anthropic-compatible APIs.

4

Section 04

Key Technical Features: Parallel Strategies and Performance Optimization

GenMLX's key technical features include:

  • Tensor Parallelism and Pipeline Parallelism: Supports heterogeneous memory configurations. Homogeneous device clusters automatically select tensor parallelism, while heterogeneous clusters select pipeline parallelism, no manual sharding required.

  • L2 Disk Cache: Implements 200GB+ SSD KV cache, reducing cold start pre-filling time from 88 minutes to 37 seconds. Conversations sharing system prompts can reuse the cache.

  • Network Topology Support: Supports Thunderbolt5 RDMA (best performance), Thunderbolt4/3 RDMA, 10GbE Ethernet, and 1GbE Ethernet (performance degradation). The grid setup wizard can recommend the optimal configuration.

5

Section 05

Performance Requirements and Tool Compatibility

Performance and Resource Requirements

Component Minimum Configuration Recommended Configuration
Number of Macs 1 M-series 2-6 M-series
Per-Mac Memory 32GB 96GB/192GB/512GB
Per-Mac Storage 50GB available 500GB+ (models + cache)
macOS Version 14 Sonoma 15 Sequoia
Network (Multi-node) 1 GbE/Wi-Fi Thunderbolt5 RDMA

Compatibility and Integration

GenMLX provides OpenAI-compatible API endpoints (/v1/chat/completions, /v1/completions, /v1/models) that can be directly integrated with tools like Claude Code, Cline, opencode, and OpenWebUI without modifying client code. It also natively supports the Anthropic API adapter, allowing direct access to Claude Code.

6

Section 06

Use Cases and Value

GenMLX is suitable for the following scenarios:

  1. Privacy-First Local Inference: No API key required, no rate limits, data never leaves the private network.

  2. Maximize Existing Hardware: Combine multiple Macs to run large models that cannot fit on a single device (e.g., 100B+ parameter models like DeepSeek V4, Qwen3-Coder-Next).

  3. Fast First Token Time: Disk cache significantly reduces first token response time in long-context scenarios.

  4. Development and Testing Environment: Provides a local, controllable model service environment for AI application development.

7

Section 07

Comparison with Similar Projects and Summary

Differences from Similar Projects

Compared to projects like EXO Labs, GenMLX has a more focused positioning:

  • Fixed Topology vs Dynamic Discovery: Assumes a fixed private device cluster instead of cross-device dynamic discovery.
  • Apple Silicon Exclusive: Deeply optimized for the MLX framework and unified memory architecture.
  • Simplified Deployment: One-click installation via curl | bash, completing from installation to generating the first token within 15 minutes.

Summary

GenMLX represents a new idea for local AI infrastructure: building a simple and reliable inference cluster within a private network using手边 hardware. For developers or teams with multiple Apple Silicon Macs, it is a solution worth trying. The project is currently in the pre-alpha stage (v0.1.0.dev0) and is moving towards v1.0.0.