Zing Forum

Reading

Thunderbolt 5 RDMA Cluster Practice: A New Distributed Large Model Inference Solution on Apple Silicon

This article introduces a distributed LLM inference cluster solution for Apple Silicon based on Thunderbolt 5 and JACCL technologies, achieving an inter-node transmission speed of up to 7.4GB/s and providing a complete toolchain and benchmark framework.

Thunderbolt 5RDMAApple Silicon分布式推理JACCL大语言模型集群ExoMLXMac Studio
Published 2026-04-07 14:44Recent activity 2026-04-07 16:13Estimated read 6 min
Thunderbolt 5 RDMA Cluster Practice: A New Distributed Large Model Inference Solution on Apple Silicon
1

Section 01

Thunderbolt5 RDMA Cluster Practice: Introduction to the New Distributed LLM Inference Solution on Apple Silicon

This article introduces a distributed LLM inference cluster solution for Apple Silicon based on Thunderbolt5 and JACCL technologies, achieving an inter-node transmission speed of up to 7.4GB/s and providing a complete toolchain and benchmark framework. This solution uses consumer-grade hardware to build a high-performance AI cluster, balancing data privacy, cost-effectiveness, and flexibility.

2

Section 02

Background: Why Do We Need a New Distributed Inference Solution for Apple Silicon

The parameter scale of large language models has grown to hundreds of billions, making single-machine inference insufficient to meet requirements. Traditional solutions such as cloud APIs (privacy/latency issues), high-end GPU servers (high cost), and multi-machine distributed systems (relying on professional network equipment) have shortcomings. Apple Silicon devices (Mac Studio/Mini) have become popular choices for local inference due to their unified memory architecture and energy efficiency ratio, but the memory of a single device is limited—how to form an efficient cluster is a key challenge.

3

Section 03

Technical Solution: Thunderbolt5, JACCL, and Cluster Configuration

Thunderbolt5: Bidirectional bandwidth of 80Gbps (twice that of TB4), supporting RDMA (Direct Memory Access, which reduces latency). JACCL: A collective communication library developed by Apple, optimized for Apple Silicon. Cluster Configuration: Three-node full mesh topology (Mac Studio M3 Ultra as the main node, two Mac Mini M4 Pro as worker nodes). Network Innovation: JACCL can coexist with bridge0 without needing to destroy it—just configure an independent IP for each TB interface. Exo Patch: Add RDMA loop detection, bridge0 classification, and other patches to the Exo framework to simplify deployment.

4

Section 04

Performance Testing: Transmission Speed and Task Benchmarks

Transmission Speed: Using the rdma-cp.sh and transfer.py tools, the three-node full mesh topology achieves a continuous transmission speed of 7.4GB/s, which is nearly 30 times faster than rsync over SSH (e.g., transferring 250GB from Vader to Voldemort takes 88 seconds, with a speed of 2.84GB/s). Task Benchmarks: Tested on Agentic coding tasks (CLI tools, SSG, REST API, etc.). Qwen3-235B-A22B (8-bit) scored 100 points in CLI tool tasks; Qwen3-Coder-Next (bf16) averaged 39 points. The thinking model will experience performance degradation due to KV cache pressure—restarting the cluster between tasks is recommended.

5

Section 05

Practical Toolchain: Model Transfer and Cluster Operations

Model Transfer: Use rdma-cp.sh to quickly transfer models (example: ./rdma-cp.sh ~/.exo/models/... voldemort:~/.exo/models/...). Cluster Operations: Verify RDMA status (ibv_devinfo | grep -E 'hca_id|state:'), start the cluster (bash ~/exo-src/start-cluster.sh), and deploy models (curl POST request).

6

Section 06

Known Issues and Solutions

  1. Thinking Model Performance Degradation: Long-term inference leads to timeout due to KV cache pressure → Restart the cluster between tasks. 2. MLX Memory Release: Terminated MLX processes do not release memory → Use SIGTERM (effective) or restart the device. 3. Mac Studio Port Issue: Avoid using the TB5 port adjacent to the Ethernet port for RDMA. 4. Model Compatibility: Exo does not yet support model types such as gemma4 and mimo_v2_flash.
7

Section 07

Technical Significance and Future Outlook

This project demonstrates the possibility of building a high-performance AI cluster using consumer-grade hardware. By leveraging TB5 RDMA and Apple Silicon's unified memory, it builds a distributed inference environment at low cost. It provides researchers and developers with advantages such as data privacy (local operation), cost-effectiveness, flexibility, and high energy efficiency. With the development of the MLX ecosystem and the improvement of JACCL, more consumer-grade distributed AI solutions will emerge, making large model inference accessible to personal studios and small teams.