Reading

Self-hosted LLM Platform Based on K3s: GPU Inference, Multi-Model Switching, and Cloud-Native Agent Toolchain

A complete proof-of-concept (POC) project demonstrating how to build a production-grade LLM inference platform on a single-node K3s cluster, supporting vLLM backend, LiteLLM gateway, dynamic multi-model switching, and a full observability system.

LLM平台K3svLLMLiteLLMGPU推理云原生Kubernetes多模型切换自托管AI

Published 2026-06-15 22:13Recent activity 2026-06-15 22:22Estimated read 11 min

Self-hosted LLM Platform Based on K3s: GPU Inference, Multi-Model Switching, and Cloud-Native Agent Toolchain

Section 01

Core Overview of the K3s-Based Self-Hosted LLM Platform

The K3s-based self-hosted LLM platform is a proof-of-concept (POC) project maintained by bitnik, released on June 15, 2026 (GitHub link: https://github.com/bitnik/llm-platform). This project demonstrates how to build a production-grade LLM inference platform on a single-node K3s cluster, with core features including:

Using vLLM as the inference backend
Implementing unified API access via LiteLLM gateway
Supporting dynamic multi-model switching
Built-in full observability system (Prometheus+Grafana+OTel)

This thread will analyze the platform's background, architecture, key mechanisms, deployment process, and technology selection across different floors.

Section 02

Project Background and Objectives

With the penetration of LLMs in development workflows, more and more teams are exploring private infrastructure deployment solutions. Self-hosted LLMs offer advantages such as data privacy, cost control, and model selection flexibility, but also face challenges like complex architecture, high operation and maintenance thresholds, and difficult resource management.

This project aims to provide a complete production-grade LLM inference platform POC on a single-node K3s, not only verifying model operation capabilities but also covering key production deployment links such as GPU scheduling, Operator management, and Kubernetes manifest orchestration.

Section 03

Layered Analysis of Platform Architecture

The platform adopts a layered and decoupled cloud-native architecture with clear responsibilities for each component:

External Client Layer

Developers can interact with the platform via Claude Code (HTTPS access), kubectl-ai (K8s command-line assistant), and k8sgpt CLI (K8s diagnostic tool).

Gateway Layer

LiteLLM Proxy serves as the unified API gateway, responsible for:

Routing requests by model name
User-level API key management, budget control, and rate limiting
Unified logging and monitoring
Automatic conversion between Anthropic and OpenAI protocols

Inference Layer

vLLM is the core inference engine, deployed on a single physical GPU. The platform supports multi-model deployment, but only one model is active at any time (ACTIVE), while others are dormant.

Storage and Observability Layer

local-path PVC: Persist model weights using NVMe local storage
Prometheus + Grafana + OTel: Full monitoring, alerting, and traceability system

Section 04

Detailed Explanation of Dynamic Multi-Model Switching Mechanism

Dynamic multi-model switching is a featured design of the platform, using a 'sleep-wake' strategy:

State Definitions

State	Description	VRAM Usage
ACTIVE	Model loaded into GPU VRAM, can respond immediately	~20GB VRAM
SLEEPING (L1)	Weights offloaded from VRAM to system memory (mapping retained)	0 VRAM
COLD	Weights only stored on disk, need reloading	0 VRAM, 0 RAM

Switch Controller

The built-in switch controller manages state transitions via POST /sleep and POST /wake_up endpoints:

When switching, the currently active model enters L1 sleep (VRAM → memory)
The target model is loaded from memory/disk to VRAM to become the new active model

Value

Applicable scenarios:

Mixed use of code assistants and chatbots
Multi-tenant environments (switch on demand instead of reserving GPUs)
Cost-sensitive scenarios (maximize single GPU utilization)

Section 05

Deployment Process and Observability System

Key Steps of Deployment Process

Base Environment Preparation: Choose Ubuntu 24.04 LTS (excellent NVIDIA driver/CUDA support), install NVIDIA driver and Container Toolkit (bridge for containers to access GPUs).
K3s Cluster Setup: Single-node K3s (lightweight, built-in Traefik and local-path), deploy NVIDIA GPU Operator (convert GPUs into resources accessible by pods).
Model Service Deployment: vLLM configuration needs to request nvidia.com/gpu:1 resource, adjust VRAM utilization parameters, enable sleep mode; persist model weights via local-path PVC (avoid repeated downloads).
Gateway and Entry Configuration: LiteLLM as the only forward entry; Traefik Ingress + cert-manager for HTTPS secure access.
Client Access: Different tools have different configuration methods (e.g., kagent sets baseUrl to point to LiteLLM, Claude Code sets the ANTHROPIC_BASE_URL environment variable).

Observability System

Core metrics include GPU utilization (exported by DCGM), KV cache pressure (unique to vLLM), preemption rate, and P95 latency; optional integration with OpenTelemetry for full-link tracing.

Section 06

Technology Selection Considerations

Why Choose vLLM?

PagedAttention algorithm: Improves VRAM utilization and throughput
Continuous batching: Supports multi-user concurrent pipeline processing
OpenAI-compatible API: Reduces client migration costs
Active community: Fast iteration and new model support

Why Choose LiteLLM?

Multi-backend unification: Connects to vLLM, OpenAI, Azure OpenAI, etc.
Budget control: Fine-grained usage limits and cost management
Protocol conversion: Automatically handles API differences between vendors

Why Choose K3s?

Low resource usage: Suitable for edge/POC environments
Complete built-in components: Includes storage, Ingress, and DNS by default
Standard K8s compatibility: POC mode can be migrated to production clusters

Section 07

Current Limitations and Expansion Directions

Current Limitations

Single-node architecture: No high availability capability
Manual model switching: Requires explicit API calls, no auto load-driven switching
Storage limitation: local-path does not support cross-node migration

Expansion Directions

Multi-node expansion: Support multi-GPU distributed inference
Auto scaling: Scale vLLM replicas based on request queue length
Model cache optimization: Introduce shared storage or model repositories to accelerate cold starts
Multi-tenant isolation: Achieve stronger isolation via Namespace and NetworkPolicy

Section 08

Summary and Insights

This project provides a solid starting point for self-hosted LLM infrastructure teams. Its value lies not only in the code but also in the design ideas:

Cloud-native first: Leverage K8s orchestration capabilities to avoid self-built scheduling
Layered decoupling: Gateway, inference, and storage have clear roles, facilitating upgrades and replacements
Resource efficiency: Sleep mechanism maximizes single GPU utilization
Observability built-in: Monitoring as a first-class citizen

For teams evaluating self-hosted LLM solutions, this project offers a runnable reference implementation to help understand the full path from bare metal to service and technical trade-offs.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23