Zing Forum

Reading

Zero-Cost GPU Inference Platform: Elastic LLM Service Architecture Based on KEDA and Kubernetes

This article introduces a production-grade GPU inference platform that implements a true scale-to-zero architecture. Through KEDA's event-driven auto-scaling and Kubernetes Cluster Autoscaler's node-level elasticity, the platform incurs zero cost when idle and automatically wakes up GPU nodes for inference when requests arrive.

GPU推理KubernetesKEDA自动扩缩容vLLM成本优化云原生LLM服务
Published 2026-04-06 23:38Recent activity 2026-04-06 23:49Estimated read 6 min
Zero-Cost GPU Inference Platform: Elastic LLM Service Architecture Based on KEDA and Kubernetes
1

Section 01

Introduction: Core Value and Architecture Overview of the Zero-Cost GPU Inference Platform

This article introduces a production-grade GPU inference platform based on Kubernetes and KEDA, designed to solve the cost dilemma of LLM inference. The platform achieves true scale-to-zero through a two-layer elastic scaling architecture: both GPU nodes and Pods are zero when idle, and automatically wake up when requests arrive. Core advantages include zero idle cost, automatic handling of burst traffic, production-grade observability, etc., providing a cost-effective and high-performance LLM service solution for teams with limited budgets.

2

Section 02

Background: Cost Dilemma and Ideal Requirements for GPU Inference

LLM inference services face a dilemma: permanent GPU instances lead to idle waste, while complete shutdown requires enduring minute-level cold start delays. An ideal solution should meet: zero cost when no requests are present, automatic and fast scaling when requests arrive, support for burst traffic without packet loss, and production-grade observability and stability.

3

Section 03

Architecture Design: Two-Layer Elastic Strategy and Core Components

The platform adopts two-layer elastic scaling:

  1. Pod-level elasticity: KEDA automatically adjusts the number of Pod replicas (0 to N) based on Redis queue depth;
  2. Node-level elasticity: GKE Cluster Autoscaler automatically creates/recycles GPU nodes based on pending Pods.

Core components include:

  • API Gateway: FastAPI (asynchronous request access);
  • Message Queue: Redis (task buffering, result storage);
  • Inference Engine: vLLM (continuous batching, KV caching);
  • Monitoring: NVIDIA DCGM exporter (GPU metrics), Grafana (visual dashboard), etc.

Request flow: User request → FastAPI enqueues to Redis → KEDA triggers Pod scaling → Cluster Autoscaler starts GPU nodes → vLLM performs inference → Result returns to user.

4

Section 04

Cold Start Optimization: Key Strategies to Reduce Startup Time

Cold start is a core challenge for scale-to-zero. The platform optimizes this through the following strategies:

  1. Queue buffering: Redis queue absorbs burst traffic to avoid packet loss;
  2. Image pre-caching: GKE Secondary Boot Disk pre-stores container images to reduce pull time;
  3. Model weight persistence: PVC stores model weights to avoid repeated downloads.

After optimization, the cold start time is reduced from 9 minutes to 5 minutes (node startup: 2 minutes + model loading: 2 minutes + Pod startup: 30 seconds).

5

Section 05

Cost Analysis: Data-Supported Value Verification

Cost structure in GCP environment:

  • Control plane: ~$0.10/hour (continuous);
  • GPU node (T4 spot): ~$0.15/hour (only incurred during inference);
  • Idle time: Zero cost for GPU nodes.

For intermittent loads, it can save 60-90% of costs compared to permanent GPU instances.

6

Section 06

Deployment Guide: From Local Testing to Production Practice

Local Testing (k3d)

  1. Start vLLM container;
  2. Create k3d cluster;
  3. Install KEDA;
  4. Deploy resources;
  5. Load testing (locust).

GCP Production Deployment

  1. Run deployment script to create GKE cluster and GPU node pool;
  2. Trigger scaling (6+ requests);
  3. Monitor node/Pod status;
  4. Destroy resources after completion.

(Note: For specific commands, refer to the original project script.)

7

Section 07

Key Takeaways: Best Practices for Cloud-Native AI Infrastructure

Best practices summarized from the project:

  1. Two-layer elasticity (Pod + node level) is the key to zero cost;
  2. Queue buffering solves the problem of traffic absorption during cold start;
  3. Multi-layer optimization (image caching, model persistence) controls cold start time;
  4. vLLM continuous batching improves GPU throughput;
  5. Complete observability is a necessary condition for production deployment.

This architecture provides a reliable LLM inference solution for teams with limited budgets.