Zing Forum

Reading

Self-hosted LLM Platform Based on K3s: GPU Inference, Multi-Model Switching, and Cloud-Native Agent Toolchain

A complete proof-of-concept (POC) project demonstrating how to build a production-grade LLM inference platform on a single-node K3s cluster, supporting vLLM backend, LiteLLM gateway, dynamic multi-model switching, and a full observability system.

LLM平台K3svLLMLiteLLMGPU推理云原生Kubernetes多模型切换自托管AI
Published 2026-06-15 22:13Recent activity 2026-06-15 22:22Estimated read 11 min
Self-hosted LLM Platform Based on K3s: GPU Inference, Multi-Model Switching, and Cloud-Native Agent Toolchain
1

Section 01

Core Overview of the K3s-Based Self-Hosted LLM Platform

The K3s-based self-hosted LLM platform is a proof-of-concept (POC) project maintained by bitnik, released on June 15, 2026 (GitHub link: https://github.com/bitnik/llm-platform). This project demonstrates how to build a production-grade LLM inference platform on a single-node K3s cluster, with core features including:

  • Using vLLM as the inference backend
  • Implementing unified API access via LiteLLM gateway
  • Supporting dynamic multi-model switching
  • Built-in full observability system (Prometheus+Grafana+OTel)

This thread will analyze the platform's background, architecture, key mechanisms, deployment process, and technology selection across different floors.

2

Section 02

Project Background and Objectives

With the penetration of LLMs in development workflows, more and more teams are exploring private infrastructure deployment solutions. Self-hosted LLMs offer advantages such as data privacy, cost control, and model selection flexibility, but also face challenges like complex architecture, high operation and maintenance thresholds, and difficult resource management.

This project aims to provide a complete production-grade LLM inference platform POC on a single-node K3s, not only verifying model operation capabilities but also covering key production deployment links such as GPU scheduling, Operator management, and Kubernetes manifest orchestration.

3

Section 03

Layered Analysis of Platform Architecture

The platform adopts a layered and decoupled cloud-native architecture with clear responsibilities for each component:

External Client Layer

Developers can interact with the platform via Claude Code (HTTPS access), kubectl-ai (K8s command-line assistant), and k8sgpt CLI (K8s diagnostic tool).

Gateway Layer

LiteLLM Proxy serves as the unified API gateway, responsible for:

  • Routing requests by model name
  • User-level API key management, budget control, and rate limiting
  • Unified logging and monitoring
  • Automatic conversion between Anthropic and OpenAI protocols

Inference Layer

vLLM is the core inference engine, deployed on a single physical GPU. The platform supports multi-model deployment, but only one model is active at any time (ACTIVE), while others are dormant.

Storage and Observability Layer

  • local-path PVC: Persist model weights using NVMe local storage
  • Prometheus + Grafana + OTel: Full monitoring, alerting, and traceability system
4

Section 04

Detailed Explanation of Dynamic Multi-Model Switching Mechanism

Dynamic multi-model switching is a featured design of the platform, using a 'sleep-wake' strategy:

State Definitions

State Description VRAM Usage
ACTIVE Model loaded into GPU VRAM, can respond immediately ~20GB VRAM
SLEEPING (L1) Weights offloaded from VRAM to system memory (mapping retained) 0 VRAM
COLD Weights only stored on disk, need reloading 0 VRAM, 0 RAM

Switch Controller

The built-in switch controller manages state transitions via POST /sleep and POST /wake_up endpoints:

  • When switching, the currently active model enters L1 sleep (VRAM → memory)
  • The target model is loaded from memory/disk to VRAM to become the new active model

Value

Applicable scenarios:

  • Mixed use of code assistants and chatbots
  • Multi-tenant environments (switch on demand instead of reserving GPUs)
  • Cost-sensitive scenarios (maximize single GPU utilization)
5

Section 05

Deployment Process and Observability System

Key Steps of Deployment Process

  1. Base Environment Preparation: Choose Ubuntu 24.04 LTS (excellent NVIDIA driver/CUDA support), install NVIDIA driver and Container Toolkit (bridge for containers to access GPUs).
  2. K3s Cluster Setup: Single-node K3s (lightweight, built-in Traefik and local-path), deploy NVIDIA GPU Operator (convert GPUs into resources accessible by pods).
  3. Model Service Deployment: vLLM configuration needs to request nvidia.com/gpu:1 resource, adjust VRAM utilization parameters, enable sleep mode; persist model weights via local-path PVC (avoid repeated downloads).
  4. Gateway and Entry Configuration: LiteLLM as the only forward entry; Traefik Ingress + cert-manager for HTTPS secure access.
  5. Client Access: Different tools have different configuration methods (e.g., kagent sets baseUrl to point to LiteLLM, Claude Code sets the ANTHROPIC_BASE_URL environment variable).

Observability System

Core metrics include GPU utilization (exported by DCGM), KV cache pressure (unique to vLLM), preemption rate, and P95 latency; optional integration with OpenTelemetry for full-link tracing.

6

Section 06

Technology Selection Considerations

Why Choose vLLM?

  • PagedAttention algorithm: Improves VRAM utilization and throughput
  • Continuous batching: Supports multi-user concurrent pipeline processing
  • OpenAI-compatible API: Reduces client migration costs
  • Active community: Fast iteration and new model support

Why Choose LiteLLM?

  • Multi-backend unification: Connects to vLLM, OpenAI, Azure OpenAI, etc.
  • Budget control: Fine-grained usage limits and cost management
  • Protocol conversion: Automatically handles API differences between vendors

Why Choose K3s?

  • Low resource usage: Suitable for edge/POC environments
  • Complete built-in components: Includes storage, Ingress, and DNS by default
  • Standard K8s compatibility: POC mode can be migrated to production clusters
7

Section 07

Current Limitations and Expansion Directions

Current Limitations

  • Single-node architecture: No high availability capability
  • Manual model switching: Requires explicit API calls, no auto load-driven switching
  • Storage limitation: local-path does not support cross-node migration

Expansion Directions

  • Multi-node expansion: Support multi-GPU distributed inference
  • Auto scaling: Scale vLLM replicas based on request queue length
  • Model cache optimization: Introduce shared storage or model repositories to accelerate cold starts
  • Multi-tenant isolation: Achieve stronger isolation via Namespace and NetworkPolicy
8

Section 08

Summary and Insights

This project provides a solid starting point for self-hosted LLM infrastructure teams. Its value lies not only in the code but also in the design ideas:

  1. Cloud-native first: Leverage K8s orchestration capabilities to avoid self-built scheduling
  2. Layered decoupling: Gateway, inference, and storage have clear roles, facilitating upgrades and replacements
  3. Resource efficiency: Sleep mechanism maximizes single GPU utilization
  4. Observability built-in: Monitoring as a first-class citizen

For teams evaluating self-hosted LLM solutions, this project offers a runnable reference implementation to help understand the full path from bare metal to service and technical trade-offs.