Reading

EdgeVisor: Connecting Home Devices into a Distributed LLM Inference Cluster

EdgeVisor is an experimental project for CPU/GPU distributed large language model (LLM) inference extended from Distributed Llama. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration, allowing ordinary users to build high-performance AI inference infrastructure using edge devices.

分布式推理边缘计算LLMVulkan张量并行动态迁移GitHub

Published 2026-06-09 14:12Recent activity 2026-06-09 14:21Estimated read 7 min

EdgeVisor: Connecting Home Devices into a Distributed LLM Inference Cluster

Section 01

EdgeVisor: Connecting Home Devices into Distributed LLM Inference Clusters

EdgeVisor is an experimental project based on Distributed Llama extension for CPU/GPU distributed LLM inference. It supports connecting multiple home devices into an inference cluster, enabling non-uniform static tensor parallelism and dynamic migration. This allows ordinary users to build high-performance AI inference infrastructure using edge devices. Key features include CPU/GPU hybrid support, Vulkan GPU backend, and six acceptance regression tests. Source: GitHub repo by wansongcc (https://github.com/wansongcc/EdgeVisor, updated 2026-06-09).

Section 02

Project Background & Motivation

As LLM scales grow, single-device inference faces memory and computing bottlenecks. Many households have idle devices (laptops, desktops, Raspberry Pi, GPU workstations). EdgeVisor aims to connect these scattered edge devices into a unified cluster to accelerate LLM inference (more devices = faster inference). It extends Distributed Llama, retaining single-machine capability while adding CPU/GPU hybrid support, non-uniform static tensor parallelism, and UDS-based dynamic migration—letting users use consumer hardware instead of expensive servers.

Section 03

Core Architecture & Technical Features

Non-uniform Static Tensor Parallelism (Non-uniform Static TP)

Traditional TP assumes uniform device capability, which isn't true for heterogeneous edge devices. EdgeVisor allows configuring load ratios (e.g., --ratios "2:3:3" for three devices) to match each device's computing power.

UDS-controlled Dynamic Migration

Supports real-time adjustment of heads/FFN layer distribution via UDS. Using plan-uds-client.py, users can migrate:

--kind1: Only attention heads
--kind2: Only FFN layers
--kind3: Both heads and FFN Also supports PP-level transformer layer migration for load balancing and fault recovery.

Vulkan GPU Backend

Implements Vulkan-based GPU backend for q80/q40 quantized matrix multiplication, offering better cross-device compatibility than CUDA. Handles input width changes after online re-partitioning. CPU inference is optimized with Llama3 chat template (using [REMOVED_SPECIAL_TOKEN] as end marker).

Section 04

Engineering Structure & Acceptance Tests

Directory Structure

EdgeVisor/: Core C++/Vulkan inference engine
config/env.sh: Unified environment variables
scripts/semantic/: CPU/GPU semantic regression & distributed scripts
scripts/gpu/: GPU PP, patch regression & debug scripts
tests/semantic/: Six benchmark regression tests
docs/test_records/: Test records
maintenance/: Historical patches & debug configs
artifacts/: Historical logs & experiment results

Six Acceptance Tests

CPU single-machine test
GPU single-machine test
CPU non-uniform static test
GPU non-uniform static test
CPU non-uniform dynamic migration test
GPU non-uniform dynamic migration test Run all via run_six_benchmark_tests.sh to get input, UDS commands, token output, speed, and performance profiling.

Section 05

Use Cases & Significance

EdgeVisor applies to:

Home AI Lab: AI enthusiasts can build local clusters to run open-source LLMs without cloud services or expensive hardware.
Edge Computing Prototype: Researchers verify distributed edge inference feasibility and scheduling strategies.
Education: Students learn distributed systems, GPU programming, and LLM inference engineering.
Low-resource Environments: Local distributed inference is an alternative to cloud in network-limited or data-sensitive scenarios.

Its core value: Democratizing AI infrastructure for ordinary users.

Section 06

Limitations & Future Directions

Current Limitations

Network Dependency: Distributed inference is sensitive to bandwidth/latency; WiFi performance may lag wired networks.
Quantization Support: Mainly q40/q80; full-precision needs adaptation.
Model Compatibility: Based on Distributed Llama, supports Llama series; other models need porting.

Future Plans

Smarter dynamic load balancing algorithms
Network-aware adaptive sharding strategies
Better fault tolerance and recovery mechanisms

Section 07

Conclusion

EdgeVisor explores aggregating consumer devices into a unified computing pool for edge AI inference. While not production-ready (needs improvements in maturity), its idea of letting ordinary users build distributed AI infrastructure has important democratization significance. It's a valuable starting point for developers wanting to understand distributed LLM inference or experience cluster inference on a budget.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49