Zing Forum

Reading

SCIN: Switch-Centric In-Network Computing Architecture Accelerates Large Model Inference

This paper proposes the SCIN architecture, which directly initiates memory semantic operations via in-switch accelerators, eliminates the data return overhead of NVLink Sharp, and supports in-network quantization. On the LLaMA-2 model, it achieves a 1.74x improvement in TTFT, a 1.34x improvement in TPOT, and up to 8.7x acceleration for All-Reduce operations.

网内计算All-Reduce优化大模型推理交换机架构量化通信分布式AI
Published 2026-03-30 17:59Recent activity 2026-04-01 10:25Estimated read 5 min
SCIN: Switch-Centric In-Network Computing Architecture Accelerates Large Model Inference
1

Section 01

[Introduction] SCIN: Switch-Centric In-Network Computing Architecture Accelerates Large Model Inference

This paper proposes SCIN (Switch-Centric In-Network Computing Architecture) to address the communication bottleneck in distributed inference of large models. Its core innovation is making the switch an active computing initiator—by integrating in-switch accelerators (ISA), it eliminates the data return overhead of NVLink Sharp and supports in-network quantization. Experiments show that SCIN achieves a 1.74x improvement in TTFT, a 1.34x improvement in TPOT on the LLaMA-2 model, and up to 8.7x acceleration for All-Reduce operations.

2

Section 02

[Background] Communication Bottlenecks in Large Model Inference and Limitations of NVLink Sharp

With the growth of large model sizes, distributed inference becomes inevitable, but communication overhead is a key bottleneck. All-Reduce operations account for a high proportion in Transformer inference; traditional solutions rely on GPU execution, leading to data round trips. Although NVLink Sharp (NVLS) implements in-network reduction, it has two major limitations: 1. Redundant data return (reduction results must first return to the source GPU before broadcasting); 2. Limited operation types (only supports simple memory semantic instructions, unable to implement optimizations like in-network quantization).

3

Section 03

[Methodology] Core of SCIN Architecture: Switch Active Computing and In-Network Quantization

SCIN adopts a switch-centric architecture, allowing the switch to actively initiate computing instead of passively executing. The key component ISA has active memory operation, flexible computing support, and direct broadcasting capabilities. Technical innovations include:

  1. In-network quantization: After reduction, 16-bit data is quantized to 8-bit, saving 50% bandwidth while maintaining precision;
  2. Latency optimization: Eliminating return latency, streamlining protocol headers, and hardware acceleration—small message All-Reduce is accelerated by 8.7x;
  3. Bandwidth optimization: Quantization accelerates large message All-Reduce by 3.8x.
4

Section 04

[Evidence] Experimental Verification: Performance Improvements of SCIN on LLaMA-2 and All-Reduce

The research team verified SCIN on a multi-FPGA system:

  • Hardware platform: FPGA switch + simulated AI accelerator + high-speed links;
  • End-to-end results on LLaMA-2: 1.74x improvement in TTFT (pre-fill activation synchronization optimization), 1.34x improvement in TPOT (decoding KV Cache synchronization optimization);
  • All-Reduce micro-benchmarks: small messages (<1KB) 8.7x, medium messages (1KB-1MB) 4.2x, large messages (>1MB) 3.8x;
  • Architecture comparison: SCIN outperforms NVLS in control flow (switch active), data flow (direct broadcast), and operation support (programmable) aspects.
5

Section 05

[Conclusion and Outlook] Technical Significance, Limitations, and Future Directions of SCIN

SCIN promotes the evolution of AI network architecture: from general-purpose to dedicated, passive to active, precise to approximate. Future expansion directions include large-scale systems, rich in-network operations, and algorithm co-design; its industrialization prospects align with trends in custom chips, network offloading, and quantization. Current limitations: FPGA prototype awaits ASIC verification, quantization precision needs comprehensive evaluation, and the balance between ISA programmability and performance needs optimization. Conclusion: SCIN unlocks performance potential through architectural innovation, providing a powerful option for large model inference infrastructure.