Zing 论坛

正文

EdgeVisor:将家用设备连接成分布式LLM推理集群

EdgeVisor是一个基于Distributed Llama扩展的CPU/GPU分布式大语言模型推理实验工程,支持将多台家用设备连接成推理集群,实现非均匀静态张量并行和动态迁移,让普通用户也能利用边缘设备构建高性能AI推理基础设施。

分布式推理边缘计算LLMVulkan张量并行动态迁移GitHub
发布时间 2026/06/09 14:12最近活动 2026/06/09 14:21预计阅读 7 分钟
EdgeVisor:将家用设备连接成分布式LLM推理集群
1

章节 01

EdgeVisor: Connecting Home Devices into Distributed LLM Inference Clusters

EdgeVisor is an experimental project based on Distributed Llama extension for CPU/GPU distributed LLM inference. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration. This allows ordinary users to build high-performance AI inference infrastructure using edge devices. Key features include CPU/GPU hybrid support, Vulkan GPU backend, and six acceptance regression tests. Source: GitHub repo by wansongcc (https://github.com/wansongcc/EdgeVisor, updated 2026-06-09).

2

章节 02

Project Background & Motivation

As LLM scales grow, single-device inference faces memory and computing bottlenecks. Many households have idle devices (laptops, desktops, Raspberry Pi, GPU workstations). EdgeVisor aims to connect these scattered edge devices into a unified cluster to accelerate LLM inference (more devices = faster inference). It extends Distributed Llama, retaining single-machine capability while adding CPU/GPU hybrid support, non-uniform static tensor parallelism, and UDS-based dynamic migration—letting users use consumer hardware instead of expensive servers.

3

章节 03

Core Architecture & Technical Features

Non-uniform Static Tensor Parallelism (Non-uniform Static TP)

Traditional TP assumes uniform device capability, which isn't true for heterogeneous edge devices. EdgeVisor allows configuring load ratios (e.g., --ratios "2:3:3" for three devices) to match each device's算力.

UDS-controlled Dynamic Migration

Supports real-time adjustment of heads/FFN layer distribution via UDS. Using plan-uds-client.py, users can migrate:

  • --kind1: Only attention heads
  • --kind2: Only FFN layers
  • --kind3: Both heads and FFN Also supports PP-level transformer layer migration for load balancing and fault recovery.

Vulkan GPU Backend

Implements Vulkan-based GPU backend for q80/q40 quantized matrix multiplication, offering better cross-device compatibility than CUDA. Handles input width changes after online re-partitioning. CPU inference is optimized with Llama3 chat template (using [REMOVED_SPECIAL_TOKEN] as end marker).

4

章节 04

Engineering Structure & Acceptance Tests

Directory Structure

  • EdgeVisor/: Core C++/Vulkan inference engine
  • config/env.sh: Unified environment variables
  • scripts/semantic/: CPU/GPU semantic regression & distributed scripts
  • scripts/gpu/: GPU PP, patch regression & debug scripts
  • tests/semantic/: Six benchmark regression tests
  • docs/test_records/: Test records
  • maintenance/: Historical patches & debug configs
  • artifacts/: Historical logs & experiment results

Six Acceptance Tests

  1. CPU single-machine test
  2. GPU single-machine test
  3. CPU non-uniform static test
  4. GPU non-uniform static test
  5. CPU non-uniform dynamic migration test
  6. GPU non-uniform dynamic migration test Run all via run_six_benchmark_tests.sh to get input, UDS commands, token output, speed, and performance profiling.
5

章节 05

Use Cases & Significance

EdgeVisor applies to:

  • Home AI Lab: AI enthusiasts can build local clusters to run open-source LLMs without cloud services or expensive hardware.
  • Edge Computing Prototype: Researchers verify distributed edge inference feasibility and scheduling strategies.
  • Education: Students learn distributed systems, GPU programming, and LLM inference engineering.
  • Low-resource Environments: Local distributed inference is an alternative to cloud in network-limited or data-sensitive scenarios.

Its core value: Democratizing AI infrastructure for ordinary users.

6

章节 06

Limitations & Future Directions

Current Limitations

  • Network Dependency: Distributed inference is sensitive to bandwidth/latency; WiFi performance may lag wired networks.
  • Quantization Support: Mainly q40/q80; full-precision needs adaptation.
  • Model Compatibility: Based on Distributed Llama, supports Llama series; other models need porting.

Future Plans

  • Smarter dynamic load balancing algorithms
  • Network-aware adaptive sharding strategies
  • Better fault tolerance and recovery mechanisms
7

章节 07

Conclusion

EdgeVisor explores aggregating consumer devices into a unified computing pool for edge AI inference. While not production-ready (needs improvements in maturity), its idea of letting ordinary users build distributed AI infrastructure has important democratization significance. It's a valuable starting point for developers wanting to understand distributed LLM inference or experience cluster inference on a budget.