# Self-hosted LLM Platform Based on K3s: GPU Inference, Multi-Model Switching, and Cloud-Native Agent Toolchain

> A complete proof-of-concept (POC) project demonstrating how to build a production-grade LLM inference platform on a single-node K3s cluster, supporting vLLM backend, LiteLLM gateway, dynamic multi-model switching, and a full observability system.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-15T14:13:20.000Z
- 最近活动: 2026-06-15T14:22:32.931Z
- 热度: 161.8
- 关键词: LLM平台, K3s, vLLM, LiteLLM, GPU推理, 云原生, Kubernetes, 多模型切换, 自托管AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/k3sllm-gpuagent
- Canonical: https://www.zingnex.cn/forum/thread/k3sllm-gpuagent
- Markdown 来源: floors_fallback

---

## Core Overview of the K3s-Based Self-Hosted LLM Platform

The K3s-based self-hosted LLM platform is a proof-of-concept (POC) project maintained by bitnik, released on June 15, 2026 (GitHub link: https://github.com/bitnik/llm-platform). This project demonstrates how to build a production-grade LLM inference platform on a single-node K3s cluster, with core features including:
- Using vLLM as the inference backend
- Implementing unified API access via LiteLLM gateway
- Supporting dynamic multi-model switching
- Built-in full observability system (Prometheus+Grafana+OTel)

This thread will analyze the platform's background, architecture, key mechanisms, deployment process, and technology selection across different floors.

## Project Background and Objectives

With the penetration of LLMs in development workflows, more and more teams are exploring private infrastructure deployment solutions. Self-hosted LLMs offer advantages such as data privacy, cost control, and model selection flexibility, but also face challenges like complex architecture, high operation and maintenance thresholds, and difficult resource management.

This project aims to provide a complete production-grade LLM inference platform POC on a single-node K3s, not only verifying model operation capabilities but also covering key production deployment links such as GPU scheduling, Operator management, and Kubernetes manifest orchestration.

## Layered Analysis of Platform Architecture

The platform adopts a layered and decoupled cloud-native architecture with clear responsibilities for each component:

### External Client Layer
Developers can interact with the platform via Claude Code (HTTPS access), kubectl-ai (K8s command-line assistant), and k8sgpt CLI (K8s diagnostic tool).

### Gateway Layer
LiteLLM Proxy serves as the unified API gateway, responsible for:
- Routing requests by model name
- User-level API key management, budget control, and rate limiting
- Unified logging and monitoring
- Automatic conversion between Anthropic and OpenAI protocols

### Inference Layer
vLLM is the core inference engine, deployed on a single physical GPU. The platform supports multi-model deployment, but only one model is active at any time (ACTIVE), while others are dormant.

### Storage and Observability Layer
- local-path PVC: Persist model weights using NVMe local storage
- Prometheus + Grafana + OTel: Full monitoring, alerting, and traceability system

## Detailed Explanation of Dynamic Multi-Model Switching Mechanism

Dynamic multi-model switching is a featured design of the platform, using a 'sleep-wake' strategy:

#### State Definitions
| State | Description | VRAM Usage |
|-------|-------------|------------|
| ACTIVE | Model loaded into GPU VRAM, can respond immediately | ~20GB VRAM |
| SLEEPING (L1) | Weights offloaded from VRAM to system memory (mapping retained) | 0 VRAM |
| COLD | Weights only stored on disk, need reloading | 0 VRAM, 0 RAM |

#### Switch Controller
The built-in switch controller manages state transitions via `POST /sleep` and `POST /wake_up` endpoints:
- When switching, the currently active model enters L1 sleep (VRAM → memory)
- The target model is loaded from memory/disk to VRAM to become the new active model

#### Value
Applicable scenarios:
- Mixed use of code assistants and chatbots
- Multi-tenant environments (switch on demand instead of reserving GPUs)
- Cost-sensitive scenarios (maximize single GPU utilization)

## Deployment Process and Observability System

### Key Steps of Deployment Process
1. **Base Environment Preparation**: Choose Ubuntu 24.04 LTS (excellent NVIDIA driver/CUDA support), install NVIDIA driver and Container Toolkit (bridge for containers to access GPUs).
2. **K3s Cluster Setup**: Single-node K3s (lightweight, built-in Traefik and local-path), deploy NVIDIA GPU Operator (convert GPUs into resources accessible by pods).
3. **Model Service Deployment**: vLLM configuration needs to request `nvidia.com/gpu:1` resource, adjust VRAM utilization parameters, enable sleep mode; persist model weights via local-path PVC (avoid repeated downloads).
4. **Gateway and Entry Configuration**: LiteLLM as the only forward entry; Traefik Ingress + cert-manager for HTTPS secure access.
5. **Client Access**: Different tools have different configuration methods (e.g., kagent sets baseUrl to point to LiteLLM, Claude Code sets the ANTHROPIC_BASE_URL environment variable).

### Observability System
Core metrics include GPU utilization (exported by DCGM), KV cache pressure (unique to vLLM), preemption rate, and P95 latency; optional integration with OpenTelemetry for full-link tracing.

## Technology Selection Considerations

#### Why Choose vLLM?
- PagedAttention algorithm: Improves VRAM utilization and throughput
- Continuous batching: Supports multi-user concurrent pipeline processing
- OpenAI-compatible API: Reduces client migration costs
- Active community: Fast iteration and new model support

#### Why Choose LiteLLM?
- Multi-backend unification: Connects to vLLM, OpenAI, Azure OpenAI, etc.
- Budget control: Fine-grained usage limits and cost management
- Protocol conversion: Automatically handles API differences between vendors

#### Why Choose K3s?
- Low resource usage: Suitable for edge/POC environments
- Complete built-in components: Includes storage, Ingress, and DNS by default
- Standard K8s compatibility: POC mode can be migrated to production clusters

## Current Limitations and Expansion Directions

### Current Limitations
- Single-node architecture: No high availability capability
- Manual model switching: Requires explicit API calls, no auto load-driven switching
- Storage limitation: local-path does not support cross-node migration

### Expansion Directions
- Multi-node expansion: Support multi-GPU distributed inference
- Auto scaling: Scale vLLM replicas based on request queue length
- Model cache optimization: Introduce shared storage or model repositories to accelerate cold starts
- Multi-tenant isolation: Achieve stronger isolation via Namespace and NetworkPolicy

## Summary and Insights

This project provides a solid starting point for self-hosted LLM infrastructure teams. Its value lies not only in the code but also in the design ideas:
1. **Cloud-native first**: Leverage K8s orchestration capabilities to avoid self-built scheduling
2. **Layered decoupling**: Gateway, inference, and storage have clear roles, facilitating upgrades and replacements
3. **Resource efficiency**: Sleep mechanism maximizes single GPU utilization
4. **Observability built-in**: Monitoring as a first-class citizen

For teams evaluating self-hosted LLM solutions, this project offers a runnable reference implementation to help understand the full path from bare metal to service and technical trade-offs.
