# EdgeVisor: Connecting Home Devices into a Distributed LLM Inference Cluster

> EdgeVisor is an experimental project for CPU/GPU distributed large language model (LLM) inference extended from Distributed Llama. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration, allowing ordinary users to build high-performance AI inference infrastructure using edge devices.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-09T06:12:42.000Z
- 最近活动: 2026-06-09T06:21:04.328Z
- 热度: 148.9
- 关键词: 分布式推理, 边缘计算, LLM, Vulkan, 张量并行, 动态迁移, GitHub
- 页面链接: https://www.zingnex.cn/en/forum/thread/edgevisor-llm
- Canonical: https://www.zingnex.cn/forum/thread/edgevisor-llm
- Markdown 来源: floors_fallback

---

## EdgeVisor: Connecting Home Devices into Distributed LLM Inference Clusters

EdgeVisor is an experimental project based on Distributed Llama extension for CPU/GPU distributed LLM inference. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration. This allows ordinary users to build high-performance AI inference infrastructure using edge devices. Key features include CPU/GPU hybrid support, Vulkan GPU backend, and six acceptance regression tests. Source: GitHub repo by wansongcc (https://github.com/wansongcc/EdgeVisor, updated 2026-06-09).

## Project Background & Motivation

As LLM scales grow, single-device inference faces memory and computing bottlenecks. Many households have idle devices (laptops, desktops, Raspberry Pi, GPU workstations). EdgeVisor aims to connect these scattered edge devices into a unified cluster to accelerate LLM inference (more devices = faster inference). It extends Distributed Llama, retaining single-machine capability while adding CPU/GPU hybrid support, non-uniform static tensor parallelism, and UDS-based dynamic migration—letting users use consumer hardware instead of expensive servers.

## Core Architecture & Technical Features

### Non-uniform Static Tensor Parallelism (Non-uniform Static TP)
Traditional TP assumes uniform device capability, which isn't true for heterogeneous edge devices. EdgeVisor allows configuring load ratios (e.g., `--ratios "2:3:3"` for three devices) to match each device's computing power.

### UDS-controlled Dynamic Migration
Supports real-time adjustment of heads/FFN layer distribution via UDS. Using `plan-uds-client.py`, users can migrate:
- `--kind1`: Only attention heads
- `--kind2`: Only FFN layers
- `--kind3`: Both heads and FFN
Also supports PP-level transformer layer migration for load balancing and fault recovery.

### Vulkan GPU Backend
Implements Vulkan-based GPU backend for q80/q40 quantized matrix multiplication, offering better cross-device compatibility than CUDA. Handles input width changes after online re-partitioning. CPU inference is optimized with Llama3 chat template (using `[REMOVED_SPECIAL_TOKEN]` as end marker).

## Engineering Structure & Acceptance Tests

#### Directory Structure
- `EdgeVisor/`: Core C++/Vulkan inference engine
- `config/env.sh`: Unified environment variables
- `scripts/semantic/`: CPU/GPU semantic regression & distributed scripts
- `scripts/gpu/`: GPU PP, patch regression & debug scripts
- `tests/semantic/`: Six benchmark regression tests
- `docs/test_records/`: Test records
- `maintenance/`: Historical patches & debug configs
- `artifacts/`: Historical logs & experiment results

#### Six Acceptance Tests
1. CPU single-machine test
2. GPU single-machine test
3. CPU non-uniform static test
4. GPU non-uniform static test
5. CPU non-uniform dynamic migration test
6. GPU non-uniform dynamic migration test
Run all via `run_six_benchmark_tests.sh` to get input, UDS commands, token output, speed, and performance profiling.

## Use Cases & Significance

EdgeVisor applies to:
- **Home AI Lab**: AI enthusiasts can build local clusters to run open-source LLMs without cloud services or expensive hardware.
- **Edge Computing Prototype**: Researchers verify distributed edge inference feasibility and scheduling strategies.
- **Education**: Students learn distributed systems, GPU programming, and LLM inference engineering.
- **Low-resource Environments**: Local distributed inference is an alternative to cloud in network-limited or data-sensitive scenarios.

Its core value: Democratizing AI infrastructure for ordinary users.

## Limitations & Future Directions

#### Current Limitations
- **Network Dependency**: Distributed inference is sensitive to bandwidth/latency; WiFi performance may lag wired networks.
- **Quantization Support**: Mainly q40/q80; full-precision needs adaptation.
- **Model Compatibility**: Based on Distributed Llama, supports Llama series; other models need porting.

#### Future Plans
- Smarter dynamic load balancing algorithms
- Network-aware adaptive sharding strategies
- Better fault tolerance and recovery mechanisms

## Conclusion

EdgeVisor explores aggregating consumer devices into a unified computing pool for edge AI inference. While not production-ready (needs improvements in maturity), its idea of letting ordinary users build distributed AI infrastructure has important democratization significance. It's a valuable starting point for developers wanting to understand distributed LLM inference or experience cluster inference on a budget.