Zing Forum

Reading

RBG: An LLM Inference Service Orchestration Framework for Kubernetes

RBG (RoleBasedGroup) is a Kubernetes API specifically designed for orchestrating distributed, stateful AI inference workloads. It supports multi-role collaboration and built-in service discovery, making it particularly suitable for production deployment of decoupled architectures such as Prefill/Decode separation.

KubernetesLLM推理云原生AI基础设施分布式系统
Published 2026-04-07 14:11Recent activity 2026-04-07 16:10Estimated read 7 min
RBG: An LLM Inference Service Orchestration Framework for Kubernetes
1

Section 01

RBG: An LLM Inference Service Orchestration Framework for Kubernetes (Introduction)

RBG (RoleBasedGroup) is a Kubernetes API specifically designed for orchestrating distributed, stateful AI inference workloads. It supports multi-role collaboration and built-in service discovery, making it particularly suitable for production deployment of decoupled architectures such as Prefill/Decode separation. Through role-based organizational abstraction, it addresses the limitations of traditional Kubernetes primitives in multi-role topology management, hardware topology sensitivity, and lack of atomic operations, providing a unified orchestration view and efficient collaboration capabilities for LLM inference services.

2

Section 02

Background: Limitations of Traditional Kubernetes Primitives

Modern high-performance LLM inference systems often adopt decoupled architectures (e.g., Prefill/Decode separation), forming complex topologies with multiple roles such as Gateway and Router. However, traditional Kubernetes native resources (StatefulSet, Deployment) face the following challenges:

  1. Difficulty in multi-role topology management: Need to manage multiple resources separately, lacking a unified orchestration view;
  2. Hardware topology insensitivity: Hard to fully utilize hardware features like NVLink and PCIe;
  3. Lack of atomic operations: Cross-role operations such as deployment and upgrade lack coordination, easily leading to service interruptions or state inconsistencies.
3

Section 03

Core Concept of RBG: Role-Based Organizational Abstraction

RBG views inference services as role-based organizations. Its core concepts include:

  • Role: The basic scheduling unit. Each role (e.g., Prefill, Decode) has independent specifications, lifecycle, and policies, with configurable relationships between roles;
  • RoleBasedGroup: A set of roles forming a logical service, managed as an integrated unit with topology, statefulness, and collaboration, rather than a collection of isolated resources.
4

Section 04

Five Core Capabilities of RBG (SCOPE)

RBG builds five core capabilities (SCOPE):

  1. Topology-aware deterministic operations: Precisely control the impact of upgrades/scaling through RoleID injection and the principle of minimal replacement domain;
  2. Cross-role policy engine: Supports deployment pairing, coordinated upgrades, linked recovery, and coordinated scaling;
  3. Role dependency management: Defines role dependencies and startup order (e.g., Decode needs to start after Prefill is ready);
  4. Topology self-aware service discovery: Inject topology information into Pods to eliminate external dependencies;
  5. Topology-aware placement: Considers hardware affinity (GPU-NVLink > PCIe > RDMA > VPC) and role affinity scheduling.
5

Section 05

Typical Application Scenarios of RBG

RBG is particularly suitable for the following scenarios:

  • Large-scale production deployment: Manage tens/hundreds of GPU instances and reduce operational complexity;
  • Decoupled architectures: Support advanced architectures such as Prefill/Decode separation and speculative decoding;
  • Multi-tenant environments: Clearly partition and isolate resources for different models/user groups;
  • Hybrid cloud deployment: Optimize traffic routing and failover across availability zones/cloud providers.
6

Section 06

Version Compatibility and Ecosystem

RBG is compatible with the Kubernetes ecosystem:

RBG Version Kubernetes Version LeaderWorkerSet Version
main >=v1.28.x >=v0.7.0
v0.4.0 >=v1.28.x >=v0.7.0
v0.3.0 >=v1.28.x >=v0.6.0
The project reuses LeaderWorkerSet code, follows Kubernetes community practices, and adopts an open governance model.
7

Section 07

Conclusion and Recommendations

RBG represents a significant advancement in AI inference orchestration on Kubernetes, addressing the core challenges of traditional primitives. As LLM inference scales and architectures become more complex, RBG will become a standard in production environments. It is recommended that teams building or expanding LLM inference infrastructure carefully evaluate and adopt RBG.