Zing Forum

Reading

Toolkit Inference Mesh: Building a Distributed LLM Inference Cluster on Heterogeneous Devices

AKIVA AI's open-source Toolkit Inference Mesh enables individual developers and small-to-medium teams to build a decentralized LLM inference network on heterogeneous devices (Macs, GPU servers, etc.), supporting pipeline parallel sharding and dynamic request scheduling.

分布式推理LLM异构计算Apple SiliconSGLangMLX流水线并行P2P网络开源AI边缘计算
Published 2026-04-04 12:43Recent activity 2026-04-04 12:48Estimated read 8 min
Toolkit Inference Mesh: Building a Distributed LLM Inference Cluster on Heterogeneous Devices
1

Section 01

Introduction / Main Floor: Toolkit Inference Mesh: Building a Distributed LLM Inference Cluster on Heterogeneous Devices

AKIVA AI's open-source Toolkit Inference Mesh enables individual developers and small-to-medium teams to build a decentralized LLM inference network on heterogeneous devices (Macs, GPU servers, etc.), supporting pipeline parallel sharding and dynamic request scheduling.

2

Section 02

Project Background and Core Positioning

Toolkit Inference Mesh originated from the Parallax project developed by the Gradient team, a fully decentralized inference engine. AKIVA AI has rebranded and expanded its features based on this to form the current Toolkit version.

Compared to the original version, Toolkit Inference Mesh places greater emphasis on compatibility with heterogeneous environments, especially support for Apple Silicon Macs, and optimization for use cases of individuals and small teams.

The core goal of this project is to lower the infrastructure barrier for LLM inference. Traditionally, running large models requires expensive GPU clusters or reliance on third-party APIs, but Toolkit Inference Mesh allows users to integrate devices scattered across different locations with varying configurations into a unified inference network, enabling resource sharing and load balancing.

3

Section 03

Decentralized P2P Communication Layer

The underlying communication of Toolkit Inference Mesh is powered by Lattica, a peer-to-peer network library specifically designed for distributed AI workloads. Lattica handles node discovery, connection management, and data transmission, allowing each node in the network to act as both a client (submitting inference requests) and a server (providing computing power). This architecture inherently has fault tolerance and scalability—new nodes can join at any time, and faulty nodes can be automatically bypassed.

4

Section 04

Heterogeneous Backend Support

To support different types of hardware, the project uses a modular backend design:

  • GPU Backend: Built on SGLang, optimized for NVIDIA GPUs, supporting high-performance continuous batching and dynamic KV cache management.
  • Mac Backend: Implemented using MLX LM, Apple Silicon's native inference framework, which can fully leverage the unified memory architecture and neural engine of Mac devices.

This dual-backend design allows users to mix MacBook, Mac Studio, and NVIDIA GPU-equipped servers in the same cluster, and the system automatically selects the optimal execution path based on model sharding and current load.

5

Section 05

Pipeline Parallelism and Model Sharding

For models with parameters exceeding the memory capacity of a single machine, Toolkit Inference Mesh supports the Pipeline Parallelism model sharding strategy. Large models are horizontally split into multiple stages, each deployed on a different node, and input data flows through each stage sequentially like a pipeline. Compared to Tensor Parallelism, this approach has lower network bandwidth requirements and is more suitable for distributed scenarios where nodes are connected via ordinary internet.

6

Section 06

Supported Model Ecosystem

Toolkit Inference Mesh officially supports a variety of mainstream open-source models, covering different scenarios from general dialogue to professional code generation:

Model Series Development Team Features
DeepSeek V3/R1 DeepSeek AI High-performance open-source large model with long context support
MiniMax-M2 MiniMax AI 230B-parameter MoE architecture, only 10B activated, efficient and cost-effective
GLM-4.6 Z AI Agent-optimized model with 200K context window support
Kimi-K2 Moonshot AI Model family designed for deep reasoning and step-by-step thinking
Qwen3/Qwen2.5 Alibaba Tongyi Qianwen Excellent Chinese capabilities, multiple sizes available
gpt-oss OpenAI Open-source weight models with 20B and 120B parameters
Llama 3.x Meta Well-established ecosystem with rich community support

This extensive model support means users can flexibly choose based on specific task requirements without being locked into a single model provider's ecosystem.

7

Section 07

Local Cluster for Individual Developers

For developers with multiple devices, such as a high-end GPU-equipped desktop plus several MacBooks, Toolkit Inference Mesh provides a way to integrate these device resources. Developers can run the model's intensive computing layers on the desktop and handle context management and input/output on Macs, achieving a more efficient inference experience than a single machine.

8

Section 08

Shared Inference Pool for Small Teams

In small research teams or startups, members may be scattered across different locations, each with hardware of varying configurations. Through Toolkit Inference Mesh, teams can build a decentralized inference pool—anyone needing to run a model can submit a request to the network, which is automatically handled by currently idle nodes. This approach is more cost-effective than equipping each person with separate high-performance devices.