# GenMLX: Build an LLM Inference Cluster with Multiple Apple Silicon Macs

> GenMLX is an open-source project that allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T11:15:41.000Z
- 最近活动: 2026-06-04T11:21:53.236Z
- 热度: 150.9
- 关键词: Apple Silicon, MLX, 分布式推理, LLM, Thunderbolt, 本地部署, 张量并行, 集群
- 页面链接: https://www.zingnex.cn/en/forum/thread/genmlx-apple-silicon-mac-llm
- Canonical: https://www.zingnex.cn/forum/thread/genmlx-apple-silicon-mac-llm
- Markdown 来源: floors_fallback

---

## GenMLX: An Open-Source Solution for Building LLM Inference Clusters with Multiple Apple Silicon Macs

GenMLX is an open-source project maintained by crystech. It allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally. The project was released on June 4, 2026, and its source code is hosted on GitHub (link: https://github.com/crystech/GenMLX). Its core goal is to help users with multiple Macs make full use of existing hardware resources to run large models locally that cannot fit in the memory of a single device.

## Background: Demand for Local LLM Clusters and GenMLX's Design Intent

As the parameter scale of large language models grows, the memory and computing power of a single device are often insufficient to meet the demand. For users with multiple Apple Silicon Macs, how to use existing hardware to run larger models locally has become an urgent problem to solve. GenMLX is designed for this scenario, based on Apple's MLX framework, using the low-latency network characteristics of Thunderbolt 5 to form a unified inference cluster with multiple Macs.

## Core Architecture: Analysis of the Three-Tier Structure

GenMLX adopts a three-tier architecture:

1. **Master Node**: Acts as the cluster coordinator, responsible for hosting the Web UI and REST API, managing the SQLite agent registry, running the grid planner, tracking job status, and running rank 0 of the dispatcher.

2. **Agent Node**: A lightweight HTTP daemon runs on each working Mac, responding to commands from the master node, including file synchronization, command execution, rank startup, and grid configuration.

3. **Dispatcher**: The inference core (3000+ lines of FastAPI application) encapsulates mlx-lm, responsible for continuous batching, L2 cache management, thought token and tool call parsing, and provides OpenAI/Anthropic-compatible APIs.

## Key Technical Features: Parallel Strategies and Performance Optimization

GenMLX's key technical features include:

- **Tensor Parallelism and Pipeline Parallelism**: Supports heterogeneous memory configurations. Homogeneous device clusters automatically select tensor parallelism, while heterogeneous clusters select pipeline parallelism, no manual sharding required.

- **L2 Disk Cache**: Implements 200GB+ SSD KV cache, reducing cold start pre-filling time from 88 minutes to 37 seconds. Conversations sharing system prompts can reuse the cache.

- **Network Topology Support**: Supports Thunderbolt5 RDMA (best performance), Thunderbolt4/3 RDMA, 10GbE Ethernet, and 1GbE Ethernet (performance degradation). The grid setup wizard can recommend the optimal configuration.

## Performance Requirements and Tool Compatibility

### Performance and Resource Requirements
| Component | Minimum Configuration | Recommended Configuration |
|-----------|-----------------------|---------------------------|
| Number of Macs | 1 M-series | 2-6 M-series |
| Per-Mac Memory | 32GB | 96GB/192GB/512GB |
| Per-Mac Storage | 50GB available | 500GB+ (models + cache) |
| macOS Version | 14 Sonoma |15 Sequoia |
| Network (Multi-node) |1 GbE/Wi-Fi | Thunderbolt5 RDMA |

### Compatibility and Integration
GenMLX provides OpenAI-compatible API endpoints (/v1/chat/completions, /v1/completions, /v1/models) that can be directly integrated with tools like Claude Code, Cline, opencode, and OpenWebUI without modifying client code. It also natively supports the Anthropic API adapter, allowing direct access to Claude Code.

## Use Cases and Value

GenMLX is suitable for the following scenarios:

1. **Privacy-First Local Inference**: No API key required, no rate limits, data never leaves the private network.

2. **Maximize Existing Hardware**: Combine multiple Macs to run large models that cannot fit on a single device (e.g., 100B+ parameter models like DeepSeek V4, Qwen3-Coder-Next).

3. **Fast First Token Time**: Disk cache significantly reduces first token response time in long-context scenarios.

4. **Development and Testing Environment**: Provides a local, controllable model service environment for AI application development.

## Comparison with Similar Projects and Summary

### Differences from Similar Projects
Compared to projects like EXO Labs, GenMLX has a more focused positioning:
- **Fixed Topology vs Dynamic Discovery**: Assumes a fixed private device cluster instead of cross-device dynamic discovery.
- **Apple Silicon Exclusive**: Deeply optimized for the MLX framework and unified memory architecture.
- **Simplified Deployment**: One-click installation via curl | bash, completing from installation to generating the first token within 15 minutes.

### Summary
GenMLX represents a new idea for local AI infrastructure: building a simple and reliable inference cluster within a private network using手边 hardware. For developers or teams with multiple Apple Silicon Macs, it is a solution worth trying. The project is currently in the pre-alpha stage (v0.1.0.dev0) and is moving towards v1.0.0.
