Reading

GenMLX: Build an LLM Inference Cluster with Multiple Apple Silicon Macs

GenMLX is an open-source project that allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally.

Apple SiliconMLX分布式推理LLMThunderbolt本地部署张量并行集群

Published 2026-06-04 19:15Recent activity 2026-06-04 19:21Estimated read 8 min

GenMLX: Build an LLM Inference Cluster with Multiple Apple Silicon Macs

Section 01

GenMLX: An Open-Source Solution for Building LLM Inference Clusters with Multiple Apple Silicon Macs

GenMLX is an open-source project maintained by crystech. It allows users to connect multiple Apple Silicon Macs (M-series chips) via Thunderbolt 5 network to form a tensor-parallel inference cluster for running large language models locally. The project was released on June 4, 2026, and its source code is hosted on GitHub (link: https://github.com/crystech/GenMLX). Its core goal is to help users with multiple Macs make full use of existing hardware resources to run large models locally that cannot fit in the memory of a single device.

Section 02

Background: Demand for Local LLM Clusters and GenMLX's Design Intent

As the parameter scale of large language models grows, the memory and computing power of a single device are often insufficient to meet the demand. For users with multiple Apple Silicon Macs, how to use existing hardware to run larger models locally has become an urgent problem to solve. GenMLX is designed for this scenario, based on Apple's MLX framework, using the low-latency network characteristics of Thunderbolt 5 to form a unified inference cluster with multiple Macs.

Section 03

Core Architecture: Analysis of the Three-Tier Structure

GenMLX adopts a three-tier architecture:

Master Node: Acts as the cluster coordinator, responsible for hosting the Web UI and REST API, managing the SQLite agent registry, running the grid planner, tracking job status, and running rank 0 of the dispatcher.
Agent Node: A lightweight HTTP daemon runs on each working Mac, responding to commands from the master node, including file synchronization, command execution, rank startup, and grid configuration.
Dispatcher: The inference core (3000+ lines of FastAPI application) encapsulates mlx-lm, responsible for continuous batching, L2 cache management, thought token and tool call parsing, and provides OpenAI/Anthropic-compatible APIs.

Section 04

Key Technical Features: Parallel Strategies and Performance Optimization

GenMLX's key technical features include:

Tensor Parallelism and Pipeline Parallelism: Supports heterogeneous memory configurations. Homogeneous device clusters automatically select tensor parallelism, while heterogeneous clusters select pipeline parallelism, no manual sharding required.
L2 Disk Cache: Implements 200GB+ SSD KV cache, reducing cold start pre-filling time from 88 minutes to 37 seconds. Conversations sharing system prompts can reuse the cache.
Network Topology Support: Supports Thunderbolt5 RDMA (best performance), Thunderbolt4/3 RDMA, 10GbE Ethernet, and 1GbE Ethernet (performance degradation). The grid setup wizard can recommend the optimal configuration.

Section 05

Performance Requirements and Tool Compatibility

Performance and Resource Requirements

Component	Minimum Configuration	Recommended Configuration
Number of Macs	1 M-series	2-6 M-series
Per-Mac Memory	32GB	96GB/192GB/512GB
Per-Mac Storage	50GB available	500GB+ (models + cache)
macOS Version	14 Sonoma	15 Sequoia
Network (Multi-node)	1 GbE/Wi-Fi	Thunderbolt5 RDMA

Compatibility and Integration

GenMLX provides OpenAI-compatible API endpoints (/v1/chat/completions, /v1/completions, /v1/models) that can be directly integrated with tools like Claude Code, Cline, opencode, and OpenWebUI without modifying client code. It also natively supports the Anthropic API adapter, allowing direct access to Claude Code.

Section 06

Use Cases and Value

GenMLX is suitable for the following scenarios:

Privacy-First Local Inference: No API key required, no rate limits, data never leaves the private network.
Maximize Existing Hardware: Combine multiple Macs to run large models that cannot fit on a single device (e.g., 100B+ parameter models like DeepSeek V4, Qwen3-Coder-Next).
Fast First Token Time: Disk cache significantly reduces first token response time in long-context scenarios.
Development and Testing Environment: Provides a local, controllable model service environment for AI application development.

Section 07

Comparison with Similar Projects and Summary

Differences from Similar Projects

Compared to projects like EXO Labs, GenMLX has a more focused positioning:

Fixed Topology vs Dynamic Discovery: Assumes a fixed private device cluster instead of cross-device dynamic discovery.
Apple Silicon Exclusive: Deeply optimized for the MLX framework and unified memory architecture.
Simplified Deployment: One-click installation via curl | bash, completing from installation to generating the first token within 15 minutes.

Summary

GenMLX represents a new idea for local AI infrastructure: building a simple and reliable inference cluster within a private network using手边 hardware. For developers or teams with multiple Apple Silicon Macs, it is a solution worth trying. The project is currently in the pre-alpha stage (v0.1.0.dev0) and is moving towards v1.0.0.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49