Zing Forum

Reading

SCIN: Switch-Centric In-Network Computing Architecture for Large Model Inference

SCIN eliminates redundant data transmission of NVLink Sharp through in-switch accelerators (ISA) and a co-designed communication architecture, achieving 8.7x acceleration for small-message All-Reduce and 3.8x for large-message All-Reduce, a 1.74x improvement in TTFT, and supporting in-network quantization (INQ) to reduce bandwidth requirements.

in-network computingAll-Reduceswitch-centricLLM inferencequantizationNVLinkdistributed training
Published 2026-03-30 17:59Recent activity 2026-03-31 11:28Estimated read 6 min
SCIN: Switch-Centric In-Network Computing Architecture for Large Model Inference
1

Section 01

Key Points of the SCIN Architecture

SCIN (Switch-Centric In-Network Architecture) is a switch-centric in-network computing architecture for large model inference, aiming to solve communication bottlenecks in distributed inference. Its core innovations include in-switch accelerators (ISA), co-designed communication architecture, and support for in-network quantization (INQ), which can eliminate redundant transmission of NVLink Sharp, achieve 8.7x acceleration for small-message All-Reduce and 3.8x for large-message All-Reduce, improve LLM inference TTFT by 1.74x, and reduce bandwidth requirements.

2

Section 02

Communication Bottlenecks in Large Model Inference and Limitations of Existing Technologies

Large-scale deployment of large model inference faces communication overhead challenges, and All-Reduce operations in distributed systems often become performance bottlenecks. Although the existing NVLink Sharp technology offloads All-Reduce to switches, it has two major limitations: first, it relies on GPUs to trigger reduction, leading to redundant transmission as reduced data needs to be sent back to the source GPU before broadcasting; second, it cannot support non-memory semantic operations (such as INQ), requiring operation in FP16/BF16 precision, resulting in bandwidth waste.

3

Section 03

Design of SCIN's Switch-Centric Architecture

SCIN proposes a switch-centric paradigm, upgrading switches from passive forwarding nodes to active computing participants. Key innovations include: 1. In-switch Accelerator (ISA): actively initiates memory operations, directly broadcasts reduction results to target nodes, eliminating redundancy; 2. Co-designed Communication Architecture: sinks synchronization logic at the hardware layer to reduce software overhead; 3. INQ Support: ISA integrates a quantization module, reducing precision to 8 bits, lowering bandwidth requirements with negligible precision loss.

4

Section 04

SCIN Performance Optimization Mechanisms

SCIN optimizes performance through two major mechanisms: 1. Eliminating redundant transmission: adopts a single-hop mode, directly broadcasting results from the switch after reduction, reducing communication steps from 3 to 2, lowering latency; 2. Improving bandwidth efficiency: INQ reduces precision to 8 bits, halving bandwidth requirements with negligible precision loss, suitable for large model parameter synchronization scenarios.

5

Section 05

SCIN Experimental Validation and Performance Results

The research team implemented an SCIN prototype on a multi-FPGA system. Experimental results show: 8.7x acceleration for small-message All-Reduce and 3.8x for large-message All-Reduce; in end-to-end evaluation of the LLaMA-2 model, TTFT (Time To First Token) improved by 1.74x, and TPOT (Time Per Output Token) improved by 1.34x.

6

Section 06

Technical Significance and Industry Impact of SCIN

SCIN promotes the transformation of network computing architecture from endpoint-centric to switch-centric, making switches active computing nodes. Industry implications include: 1. Programmable Networks: switches can integrate general computing capabilities; 2. Precision-Adaptive Transmission: network protocols natively support multi-precision; 3. Hardware-Software Co-design: optimize all layers for AI workloads.

7

Section 07

Limitations and Future Directions of SCIN

Current limitations: limited performance of FPGA prototypes, ecosystem compatibility challenges, and the generality of quantization strategies to be solved. Future directions: extend to complex in-network operations (such as All-Gather), dynamic precision adjustment, and combine optical network technology to further improve performance.