Zing Forum

Reading

EdgeVisor: Connecting Home Devices into a Distributed LLM Inference Cluster

EdgeVisor is an experimental project for CPU/GPU distributed large language model (LLM) inference extended from Distributed Llama. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration, allowing ordinary users to build high-performance AI inference infrastructure using edge devices.

分布式推理边缘计算LLMVulkan张量并行动态迁移GitHub
Published 2026-06-09 14:12Recent activity 2026-06-09 14:21Estimated read 7 min
EdgeVisor: Connecting Home Devices into a Distributed LLM Inference Cluster
1

Section 01

EdgeVisor: Connecting Home Devices into Distributed LLM Inference Clusters

EdgeVisor is an experimental project based on Distributed Llama extension for CPU/GPU distributed LLM inference. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration. This allows ordinary users to build high-performance AI inference infrastructure using edge devices. Key features include CPU/GPU hybrid support, Vulkan GPU backend, and six acceptance regression tests. Source: GitHub repo by wansongcc (https://github.com/wansongcc/EdgeVisor, updated 2026-06-09).

2

Section 02

Project Background & Motivation

As LLM scales grow, single-device inference faces memory and computing bottlenecks. Many households have idle devices (laptops, desktops, Raspberry Pi, GPU workstations). EdgeVisor aims to connect these scattered edge devices into a unified cluster to accelerate LLM inference (more devices = faster inference). It extends Distributed Llama, retaining single-machine capability while adding CPU/GPU hybrid support, non-uniform static tensor parallelism, and UDS-based dynamic migration—letting users use consumer hardware instead of expensive servers.

3

Section 03

Core Architecture & Technical Features

Non-uniform Static Tensor Parallelism (Non-uniform Static TP)

Traditional TP assumes uniform device capability, which isn't true for heterogeneous edge devices. EdgeVisor allows configuring load ratios (e.g., --ratios "2:3:3" for three devices) to match each device's computing power.

UDS-controlled Dynamic Migration

Supports real-time adjustment of heads/FFN layer distribution via UDS. Using plan-uds-client.py, users can migrate:

  • --kind1: Only attention heads
  • --kind2: Only FFN layers
  • --kind3: Both heads and FFN Also supports PP-level transformer layer migration for load balancing and fault recovery.

Vulkan GPU Backend

Implements Vulkan-based GPU backend for q80/q40 quantized matrix multiplication, offering better cross-device compatibility than CUDA. Handles input width changes after online re-partitioning. CPU inference is optimized with Llama3 chat template (using [REMOVED_SPECIAL_TOKEN] as end marker).

4

Section 04

Engineering Structure & Acceptance Tests

Directory Structure

  • EdgeVisor/: Core C++/Vulkan inference engine
  • config/env.sh: Unified environment variables
  • scripts/semantic/: CPU/GPU semantic regression & distributed scripts
  • scripts/gpu/: GPU PP, patch regression & debug scripts
  • tests/semantic/: Six benchmark regression tests
  • docs/test_records/: Test records
  • maintenance/: Historical patches & debug configs
  • artifacts/: Historical logs & experiment results

Six Acceptance Tests

  1. CPU single-machine test
  2. GPU single-machine test
  3. CPU non-uniform static test
  4. GPU non-uniform static test
  5. CPU non-uniform dynamic migration test
  6. GPU non-uniform dynamic migration test Run all via run_six_benchmark_tests.sh to get input, UDS commands, token output, speed, and performance profiling.
5

Section 05

Use Cases & Significance

EdgeVisor applies to:

  • Home AI Lab: AI enthusiasts can build local clusters to run open-source LLMs without cloud services or expensive hardware.
  • Edge Computing Prototype: Researchers verify distributed edge inference feasibility and scheduling strategies.
  • Education: Students learn distributed systems, GPU programming, and LLM inference engineering.
  • Low-resource Environments: Local distributed inference is an alternative to cloud in network-limited or data-sensitive scenarios.

Its core value: Democratizing AI infrastructure for ordinary users.

6

Section 06

Limitations & Future Directions

Current Limitations

  • Network Dependency: Distributed inference is sensitive to bandwidth/latency; WiFi performance may lag wired networks.
  • Quantization Support: Mainly q40/q80; full-precision needs adaptation.
  • Model Compatibility: Based on Distributed Llama, supports Llama series; other models need porting.

Future Plans

  • Smarter dynamic load balancing algorithms
  • Network-aware adaptive sharding strategies
  • Better fault tolerance and recovery mechanisms
7

Section 07

Conclusion

EdgeVisor explores aggregating consumer devices into a unified computing pool for edge AI inference. While not production-ready (needs improvements in maturity), its idea of letting ordinary users build distributed AI infrastructure has important democratization significance. It's a valuable starting point for developers wanting to understand distributed LLM inference or experience cluster inference on a budget.