Reading

CAIS: Compute-Aware In-Switch Computing Framework for Tensor Parallelism in Large Models

This article introduces the CAIS framework, which addresses the computation-communication isolation issue in tensor parallelism across multi-GPU systems through compute-aware ISA, thread block coordination optimization, and graph-level dataflow optimizer, achieving a 1.38x training speedup.

大语言模型张量并行分布式训练NVLink多GPU系统交换机内计算计算通信重叠

Published 2026-05-07 11:29Recent activity 2026-05-08 13:23Estimated read 7 min

CAIS: Compute-Aware In-Switch Computing Framework for Tensor Parallelism in Large Models

Section 01

[Introduction] CAIS Framework: Compute-Aware In-Switch Computing Solution for Tensor Parallelism in Large Models

This article introduces the CAIS (Compute-Aware In-Switch Computing) framework, which aims to solve the computation-communication isolation problem in tensor parallelism across multi-GPU systems. Through three core technologies—compute-aware ISA extension, merge-aware thread block coordination, and graph-level dataflow optimizer—the framework achieves a 1.38x training speedup, providing a new design paradigm for large-scale AI infrastructure.

Section 02

Background: Communication Bottlenecks in Large Model Tensor Parallelism and Limitations of Existing Solutions

Background: Communication Bottlenecks in Large Model Training

As the scale of large language models (LLMs) expands, a single GPU can no longer meet the demand. Tensor parallelism (TP) has become a core strategy for distributed training, but frequent collective communication operations have become a performance bottleneck. The traditional NVLink SHARP (NVLS) technology accelerates communication via in-switch computing, but its communication-centric design has a fundamental mismatch with the memory semantics of LLM computation kernels, leading to isolation between computation and communication phases, low resource utilization, and limited overlapping capability.

Section 03

Technology 1: Compute-Aware ISA and Switch Microarchitecture Extension

Technology 1: Compute-Aware ISA and Microarchitecture Extension

CAIS defines a compute-aware instruction set architecture (ISA) and extends the switch microarchitecture. Traditional switches only handle forwarding, while CAIS enables switches to understand the memory access patterns of computation tasks (e.g., read, write, atomic operations) to optimize data flow. The microarchitecture adds a dedicated compute-aware scheduling unit that dynamically adjusts communication strategies based on GPU computation status, ensuring that data arrival timing matches computation needs.

Section 04

Technology 2: Merge-Aware Thread Block Coordination Mechanism

Technology 2: Merge-Aware Thread Block Coordination

CAIS introduces a merge-aware thread block (TB) coordination mechanism that analyzes the execution progress of each GPU TB and identifies mergeable communication requests. When multiple TBs need to access the same or adjacent data, they are coordinated to initiate requests at the same time, fully utilizing the switch's batch processing capability. This mechanism dynamically adjusts, continuously monitors TB status, predicts communication needs, and optimizes scheduling to maximize merging opportunities.

Section 05

Technology 3: Graph-Level Dataflow Optimizer Enables Cross-Kernel Overlapping

Technology 3: Graph-Level Dataflow Optimizer

CAIS's graph-level dataflow optimizer constructs a global dataflow view, analyzes the data dependencies of the computation graph, and identifies parallelization opportunities. Through prefetching data, delaying non-critical communication, and reordering operations, it achieves tight cross-kernel overlapping. This optimization aligns with the characteristics of tensor parallelism, leveraging data locality of operations like all-reduce to improve pipeline efficiency.

Section 06

Experimental Results: CAIS Achieves 1.38x Training Speedup

Experimental Evaluation and Performance

On mainstream LLM workloads, CAIS achieves an average 1.38x end-to-end training speedup compared to the state-of-the-art NVLS solution, and a 1.61x speedup compared to T3 (a computation-communication overlapping solution without NVLS). The results show that computation and communication need to be optimized as a whole—CAIS eliminates the computation-communication isolation of traditional architectures and unlocks the performance of multi-GPU systems.

Section 07

Practical Significance and Future Outlook: A New Paradigm of Computation-Network Convergence

Practical Significance and Future Outlook

CAIS has important reference value for large-scale AI infrastructure construction, demonstrating a design paradigm where network devices should be active participants in the computation ecosystem. Future switches may integrate more computing capabilities, and computation-network convergence will be an important trend for next-generation AI infrastructure. Conclusion: The 1.38x speedup from CAIS can reduce training costs or support larger models, bringing significant cost-effectiveness advantages to AI cluster construction.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15