Reading

Awex: A Reinforcement Learning Training and Inference Framework Enabling Second-level Weight Synchronization for Trillion-Parameter Models

Awex is an open-source high-performance reinforcement learning weight synchronization framework developed by InclusionAI. It supports full weight synchronization of trillion-parameter models in 10 seconds on a 1,000-GPU cluster, solving the parameter update latency issue between training and inference in RLHF training.

强化学习RLHF权重同步大语言模型分布式训练NCCLRDMAMegatronvLLM推理优化

Published 2026-04-10 21:40Recent activity 2026-04-10 21:45Estimated read 7 min

Awex: A Reinforcement Learning Training and Inference Framework Enabling Second-level Weight Synchronization for Trillion-Parameter Models

Section 01

[Main Floor/Introduction] Awex: An RL Training and Inference Framework Enabling Second-level Weight Synchronization for Trillion-Parameter Models

Awex is an open-source high-performance reinforcement learning weight synchronization framework developed by InclusionAI. Its core goal is to solve the parameter update latency issue between training and inference ends in reinforcement learning training such as RLHF. The framework has been validated on a 1,000-GPU cluster, supporting full weight synchronization of trillion-parameter models in 10 seconds, providing efficient collaboration capabilities for large-scale reinforcement learning training.

Section 02

Background: Weight Synchronization Bottleneck in RL Training

In the reinforcement learning training of large language models (such as RLHF, DPO, etc.), traditional weight synchronization methods require writing weights to the storage system first before loading by the inference end, which takes several minutes or even longer. This latency severely restricts algorithm iteration efficiency, especially in online RL scenarios where the inference end needs to frequently use the latest model to generate responses— the synchronization bottleneck significantly affects training throughput and convergence speed.

Section 03

Core Technical Features of Awex

Awex's core technical features include:

Extreme Synchronization Speed: 0.8 seconds for 10-billion-parameter synchronization under NCCL mode, 20 seconds for trillion-parameter; only 6 seconds for trillion-parameter under RDMA mode;
Unified Weight Adaptation Layer: Automatically handles parallel strategy and tensor layout differences between training (e.g., Megatron) and inference (e.g., vLLM) engines;
Zero-Redundancy Transmission & In-Place Update: Only transmits necessary weight shards; inference end updates GPU memory in-place to avoid additional overhead;
Multi-Mode Transmission Support: Compatible with high-speed interconnection technologies like NCCL, RDMA, and shared memory;
Heterogeneous Deployment Compatibility: Supports co-located/separated deployment, adapting to the needs of synchronous and asynchronous RL algorithms.

Section 04

Awex Architecture Design and Core Workflow

Architecture Components

WeightWriter: Training nodes collect weight shard metadata, convert formats, and build transmission plans;
WeightReader: Inference instances receive weight data and complete local updates;
MetaServer: Global metadata exchange and coordination hub.

Weight Exchange Workflow

Unified format conversion: Convert weights from different engines (Megatron, vLLM, etc.) into a standard format;
Global metadata exchange: Collect shard metadata and report to MetaServer;
P2P transmission plan construction: Generate peer-to-peer transmission plans based on metadata;
Transmission execution: Use NCCL/RDMA for data transmission;
Tensor-level verification: Compare weights from transmission and file loading to ensure correctness.

Section 05

Performance Verification and Application Scenarios

Awex's performance leads in benchmark tests and can effectively solve synchronization bottlenecks. Applicable scenarios include:

Online RLHF training: Need to frequently synchronize the latest model to generate high-quality training data;
Multi-round iterative optimization: Reduce training cycles in fast iteration scenarios;
Large-scale cluster training: Efficient collaboration in 1,000/10,000 GPU scale clusters;
Real-time inference services: Quickly deploy the latest model version in production environments.

Section 06

Summary and Outlook

Awex successfully solves the weight synchronization bottleneck in large-scale RL training through innovative architecture and efficient transmission mechanisms. Its second-level synchronization capability enables online reinforcement learning training of trillion-parameter models, providing solid support for the continuous optimization of large language models. In the future, as the scale of large models grows, such specialized optimized weight synchronization frameworks will play a more important role in the AI infrastructure field.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15