Zing Forum

Reading

KubeRay Batch Inference: Production-Grade Distributed LLM Offline Inference Solution on Kubernetes

This is a production-grade reference implementation of a distributed offline large model batch inference service based on KubeRay, Ray Data, and FastAPI. The project fully demonstrates how to deploy a scalable LLM inference service in a Kubernetes environment, supporting OpenAI-compatible Batches API, and implementing complete authentication, state management, and streaming result return.

KubeRayRay分布式推理批处理LLMKubernetesFastAPIQwen大模型部署云原生AI
Published 2026-04-15 12:44Recent activity 2026-04-15 12:50Estimated read 7 min
KubeRay Batch Inference: Production-Grade Distributed LLM Offline Inference Solution on Kubernetes
1

Section 01

KubeRay Batch Inference: Production-Grade Distributed LLM Offline Inference Solution on Kubernetes

This post introduces KubeRay Batch Inference, a production-grade reference implementation of distributed offline LLM batch inference service based on KubeRay, Ray Data, and FastAPI. It supports OpenAI-compatible Batches API, complete authentication, state management, and streaming result return, addressing the engineering challenges of large-scale offline inference in enterprises.

2

Section 02

Background: Engineering Challenges of LLM Batch Inference

With LLM's wide application, enterprises face the challenge of efficient large-scale offline inference. Unlike real-time online inference, batch inference requires handling massive data, demanding higher throughput, resource utilization, and cost control. Traditional single-machine solutions hit bottlenecks with surging data, while simple distributed solutions lack stability, observability, and ease of use in production. This project aims to solve these pain points.

3

Section 03

Project Overview: Technical Stack & Core Features

KubeRay Batch Inference uses KubeRay as the distributed computing framework, Ray Data for data streaming, FastAPI for RESTful API, targeting Alibaba Cloud's Qwen2.5-0.5B-Instruct model. Core features include: OpenAI-compatible Batches API (low migration cost), KubeRay-based elastic distributed architecture (horizontal scaling), static API Key authentication, PostgreSQL for state persistence, application/x-ndjson streaming results, and production-grade quality (169 test cases, 100% code coverage, full CI/CD).

4

Section 04

Architecture Design: End-to-End Workflow

The system runs on a Kubernetes cluster with core components: FastAPI proxy (auth, validation, job submission/query), PostgreSQL (job metadata storage), KubeRay-managed RayCluster (distributed inference), and shared PVC (input/output data access). Request flow: 1. Auth & validation; 2. Job registration (queued state);3. Submit to RayCluster via Ray Jobs API;4. Ray Data's map_batches parallel processing;5. Results written to PVC as JSONL;6. Async poller updates job state (5-second interval);7. Client streams results. Job states: queued → in_progress → completed/failed/cancelled.

5

Section 05

Technical Implementation Highlights

Key implementations: 1. Model containerization: Multi-stage Docker build packs Qwen2.5-0.5B-Instruct weights into Ray Worker image (runtime zero dependency, fast startup, version consistency).2. Security: Static API Key via X-API-Key header, using hmac.compare_digest to prevent timing attacks; 401 for unauthorized requests.3. Observability: Prometheus metrics (HTTP requests, job stats), JSON-structured logs (request_id, batch_id), X-Request-ID for tracing.4. Testing: 169 test cases (100% coverage), CI with ruff, mypy, pytest, kubeconform; end-to-end tests on kind+KubeRay cluster.

6

Section 06

Deployment & Usage Guide

Environment reqs: Ubuntu 22.04/24.04, WSL2, macOS; 8+ CPU cores,16GB+ memory; Docker, kind, kubectl, Helm, Python3.11+. Quick start: bash scripts/setup.sh (install dependencies), make up (create kind cluster, deploy KubeRay Operator, RayCluster (1 head+2 workers), PostgreSQL, PVC, FastAPI, port forwarding). API examples: Submit batch task via curl (with model, input, max_tokens), query state, stream results (NDJSON).

7

Section 07

Application Scenarios & Core Value

Applicable scenarios: Large-scale text processing (classification, sentiment analysis), content generation (marketing copy, product descriptions), data annotation (automated preprocessing), model evaluation (batch testing). Core value: Production-grade reference (ready for production), cloud-native AI best practices, learning resource (detailed docs/annotations), extensible foundation (modular design for customization).

8

Section 08

Summary & Outlook

KubeRay Batch Inference provides a complete, production-ready distributed LLM batch inference solution, integrating cutting-edge AI with cloud-native infrastructure. It serves as both a usable technical solution and a learning resource for teams exploring LLM applications. As LLM tech evolves, such distributed inference infrastructure will grow in importance, and this project offers a valuable implementation paradigm.