Reading

SCIN: Switch-Centric In-Network Computing Architecture Accelerates Large Model Inference

This paper proposes the SCIN architecture, which directly initiates memory semantic operations via in-switch accelerators, eliminates the data return overhead of NVLink Sharp, and supports in-network quantization. On the LLaMA-2 model, it achieves a 1.74x improvement in TTFT, a 1.34x improvement in TPOT, and up to 8.7x acceleration for All-Reduce operations.

网内计算All-Reduce优化大模型推理交换机架构量化通信分布式AI

Published 2026-03-30 17:59Recent activity 2026-04-01 10:25Estimated read 5 min

Section 01

[Introduction] SCIN: Switch-Centric In-Network Computing Architecture Accelerates Large Model Inference

This paper proposes SCIN (Switch-Centric In-Network Computing Architecture) to address the communication bottleneck in distributed inference of large models. Its core innovation is making the switch an active computing initiator—by integrating in-switch accelerators (ISA), it eliminates the data return overhead of NVLink Sharp and supports in-network quantization. Experiments show that SCIN achieves a 1.74x improvement in TTFT, a 1.34x improvement in TPOT on the LLaMA-2 model, and up to 8.7x acceleration for All-Reduce operations.

Section 02

[Background] Communication Bottlenecks in Large Model Inference and Limitations of NVLink Sharp

With the growth of large model sizes, distributed inference becomes inevitable, but communication overhead is a key bottleneck. All-Reduce operations account for a high proportion in Transformer inference; traditional solutions rely on GPU execution, leading to data round trips. Although NVLink Sharp (NVLS) implements in-network reduction, it has two major limitations: 1. Redundant data return (reduction results must first return to the source GPU before broadcasting); 2. Limited operation types (only supports simple memory semantic instructions, unable to implement optimizations like in-network quantization).

Section 03

[Methodology] Core of SCIN Architecture: Switch Active Computing and In-Network Quantization

SCIN adopts a switch-centric architecture, allowing the switch to actively initiate computing instead of passively executing. The key component ISA has active memory operation, flexible computing support, and direct broadcasting capabilities. Technical innovations include:

In-network quantization: After reduction, 16-bit data is quantized to 8-bit, saving 50% bandwidth while maintaining precision;
Latency optimization: Eliminating return latency, streamlining protocol headers, and hardware acceleration—small message All-Reduce is accelerated by 8.7x;
Bandwidth optimization: Quantization accelerates large message All-Reduce by 3.8x.

Section 04

[Evidence] Experimental Verification: Performance Improvements of SCIN on LLaMA-2 and All-Reduce

The research team verified SCIN on a multi-FPGA system:

Hardware platform: FPGA switch + simulated AI accelerator + high-speed links;
End-to-end results on LLaMA-2: 1.74x improvement in TTFT (pre-fill activation synchronization optimization), 1.34x improvement in TPOT (decoding KV Cache synchronization optimization);
All-Reduce micro-benchmarks: small messages (<1KB) 8.7x, medium messages (1KB-1MB) 4.2x, large messages (>1MB) 3.8x;
Architecture comparison: SCIN outperforms NVLS in control flow (switch active), data flow (direct broadcast), and operation support (programmable) aspects.

Section 05

[Conclusion and Outlook] Technical Significance, Limitations, and Future Directions of SCIN

SCIN promotes the evolution of AI network architecture: from general-purpose to dedicated, passive to active, precise to approximate. Future expansion directions include large-scale systems, rich in-network operations, and algorithm co-design; its industrialization prospects align with trends in custom chips, network offloading, and quantization. Current limitations: FPGA prototype awaits ASIC verification, quantization precision needs comprehensive evaluation, and the balance between ISA programmability and performance needs optimization. Conclusion: SCIN unlocks performance potential through architectural innovation, providing a powerful option for large model inference infrastructure.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15