正文

EdgeVisor：将家用设备连接成分布式LLM推理集群

EdgeVisor是一个基于Distributed Llama扩展的CPU/GPU分布式大语言模型推理实验工程，支持将多台家用设备连接成推理集群，实现非均匀静态张量并行和动态迁移，让普通用户也能利用边缘设备构建高性能AI推理基础设施。

分布式推理边缘计算LLMVulkan张量并行动态迁移GitHub

发布时间 2026/06/09 14:12最近活动 2026/06/09 14:21预计阅读 7 分钟

章节 01

EdgeVisor: Connecting Home Devices into Distributed LLM Inference Clusters

EdgeVisor is an experimental project based on Distributed Llama extension for CPU/GPU distributed LLM inference. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration. This allows ordinary users to build high-performance AI inference infrastructure using edge devices. Key features include CPU/GPU hybrid support, Vulkan GPU backend, and six acceptance regression tests. Source: GitHub repo by wansongcc (https://github.com/wansongcc/EdgeVisor, updated 2026-06-09).

章节 02

Project Background & Motivation

As LLM scales grow, single-device inference faces memory and computing bottlenecks. Many households have idle devices (laptops, desktops, Raspberry Pi, GPU workstations). EdgeVisor aims to connect these scattered edge devices into a unified cluster to accelerate LLM inference (more devices = faster inference). It extends Distributed Llama, retaining single-machine capability while adding CPU/GPU hybrid support, non-uniform static tensor parallelism, and UDS-based dynamic migration—letting users use consumer hardware instead of expensive servers.

章节 03

Core Architecture & Technical Features

Non-uniform Static Tensor Parallelism (Non-uniform Static TP)

Traditional TP assumes uniform device capability, which isn't true for heterogeneous edge devices. EdgeVisor allows configuring load ratios (e.g., --ratios "2:3:3" for three devices) to match each device's算力.

UDS-controlled Dynamic Migration

Supports real-time adjustment of heads/FFN layer distribution via UDS. Using plan-uds-client.py, users can migrate:

--kind1: Only attention heads
--kind2: Only FFN layers
--kind3: Both heads and FFN Also supports PP-level transformer layer migration for load balancing and fault recovery.

Vulkan GPU Backend

Implements Vulkan-based GPU backend for q80/q40 quantized matrix multiplication, offering better cross-device compatibility than CUDA. Handles input width changes after online re-partitioning. CPU inference is optimized with Llama3 chat template (using [REMOVED_SPECIAL_TOKEN] as end marker).

章节 04

Engineering Structure & Acceptance Tests

Directory Structure

EdgeVisor/: Core C++/Vulkan inference engine
config/env.sh: Unified environment variables
scripts/semantic/: CPU/GPU semantic regression & distributed scripts
scripts/gpu/: GPU PP, patch regression & debug scripts
tests/semantic/: Six benchmark regression tests
docs/test_records/: Test records
maintenance/: Historical patches & debug configs
artifacts/: Historical logs & experiment results

Six Acceptance Tests

CPU single-machine test
GPU single-machine test
CPU non-uniform static test
GPU non-uniform static test
CPU non-uniform dynamic migration test
GPU non-uniform dynamic migration test Run all via run_six_benchmark_tests.sh to get input, UDS commands, token output, speed, and performance profiling.

章节 05

Use Cases & Significance

EdgeVisor applies to:

Home AI Lab: AI enthusiasts can build local clusters to run open-source LLMs without cloud services or expensive hardware.
Edge Computing Prototype: Researchers verify distributed edge inference feasibility and scheduling strategies.
Education: Students learn distributed systems, GPU programming, and LLM inference engineering.
Low-resource Environments: Local distributed inference is an alternative to cloud in network-limited or data-sensitive scenarios.

Its core value: Democratizing AI infrastructure for ordinary users.

章节 06

Limitations & Future Directions

Current Limitations

Network Dependency: Distributed inference is sensitive to bandwidth/latency; WiFi performance may lag wired networks.
Quantization Support: Mainly q40/q80; full-precision needs adaptation.
Model Compatibility: Based on Distributed Llama, supports Llama series; other models need porting.

Future Plans

Smarter dynamic load balancing algorithms
Network-aware adaptive sharding strategies
Better fault tolerance and recovery mechanisms

章节 07

Conclusion

EdgeVisor explores aggregating consumer devices into a unified computing pool for edge AI inference. While not production-ready (needs improvements in maturity), its idea of letting ordinary users build distributed AI infrastructure has important democratization significance. It's a valuable starting point for developers wanting to understand distributed LLM inference or experience cluster inference on a budget.