# KVWarden: Single-GPU Multi-Tenant Fair Scheduling, an LLM Inference Orchestration Layer Without Kubernetes

> A lightweight middleware that implements multi-tenant fair scheduling on top of vLLM/SGLang. It uses token-bucket rate limiting to ensure quiet users get predictable TTFT even under high load, without needing Kubernetes.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-22T03:36:34.000Z
- 最近活动: 2026-04-22T04:46:39.871Z
- 热度: 160.8
- 关键词: LLM, inference, multi-tenant, fairness, vLLM, SGLang, GPU, orchestration, rate-limiting
- 页面链接: https://www.zingnex.cn/en/forum/thread/kvwarden-gpu-kubernetesllm
- Canonical: https://www.zingnex.cn/forum/thread/kvwarden-gpu-kubernetesllm
- Markdown 来源: floors_fallback

---

## KVWarden: Lightweight Single-GPU Multi-Tenant Fair Scheduling Without Kubernetes

KVWarden is a lightweight orchestration layer (≈3500 lines of code) running on vLLM/SGLang. It addresses multi-tenant fairness issues in LLM inference by token-bucket rate limiting, supports single-GPU multi-model lifecycle management (frequency+recency strategy), and provides an OpenAI-compatible HTTP API. It eliminates the need for Kubernetes, making it ideal for small teams or edge deployments.

## Background: The Fairness Challenge in Multi-Tenant LLM Inference

When multiple users/apps share a GPU for LLM inference, resource allocation fairness becomes critical. For example, a noisy neighbor (32 RPS) and quiet user (1 RPS) on Llama-3.1-8B: without fairness control, quiet user's TTFT p99 jumps from 53.9ms to 1585ms (29x increase). This is the problem KVWarden solves.

## Core Capabilities of KVWarden

KVWarden adds three key features to vLLM/SGLang:
1. Tenant-level token-bucket rate limiting: Controls request rate at entry, reducing quiet user's TTFT p99 to 61.5ms (14% higher than no competition).
2. Single-GPU multi-model lifecycle management: Uses frequency+recency for model switching and cache eviction (smarter than LRU).
3. OpenAI-compatible HTTP API: Existing apps can integrate without code changes.

## Technical Architecture of KVWarden

Core components:
- WorkloadRouter: Request analysis, length-aware scheduling, OpenAI API, streaming support.
- AdmissionController: Concurrency limit, priority queue (lower number = higher priority), Prometheus metrics.
- TenantManager: Tenant budget management, token-bucket implementation, DRR priority scoring.
- CacheManager: Model KV cache lifecycle, snapshot on unloading, layered eviction.
Request flow: Client → WorkloadRouter → TenantManager (rate limit check) → AdmissionController (queue/admit) → CacheManager (model/KV) → vLLM/SGLang. Requests use X-Tenant-ID header for tenant identification.

## Key Experimental Results

Experiments on A100-SXM4 80GB with Llama-3.1-8B (vLLM):
1. Fairness test: 32 RPS noisy +1 RPS quiet. Token-bucket reduces quiet user's TTFT p99 from 1585ms (29x) to61.5ms (1.14x).
2. Admission cap test: Global concurrency limits don't improve single-model performance (vLLM's batching is efficient).
3. Benchmark framework validation: End-to-end tests with real vLLM ensure no systematic bias.

## Comparison with Existing Solutions

KVWarden fills a gap as a no-Kubernetes single-node multi-tenant fair scheduler:
| System | Needs K8s | Multi-model | Tenant Fairness | Scenario |
|--------|-----------|-------------|-----------------|----------|
| NVIDIA Dynamo | Yes | Yes | No | Data center |
| llm-d (CNCF) | Yes | Single pool | No | Cloud-native large scale |
| Mammoth | Yes | Yes | No | Multi-hardware |
| AIBrix | Yes | Yes | No | Enterprise |
| Ollama | No | LRU eviction | No | Local single node |
| vLLM/SGLang | No | Single model | No | Basic inference |
| KVWarden | No | Yes (freq+recency) | Yes (token-bucket+DRR) | Single node (1-4 GPUs) |

## How to Use KVWarden

Quick start:
`pip install kvwarden`
`kvwarden serve --config configs/quickstart_fairness.yaml`
Wait for health check: `until curl -fs localhost:8000/health > /dev/null; do sleep2; done`
Send requests as tenants:
Noisy: `curl localhost:8000/v1/completions -H "X-Tenant-ID: noisy" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'`
Quiet: `curl localhost:8000/v1/completions -H "X-Tenant-ID: quiet" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}'`
Config example (yaml): tenants with rate limits, models with engine/gpu memory. CLI tools: version, doctor, man, bench.

## Limitations and Future Roadmap

Limitations: Not a vLLM/SGLang replacement; not for K8s-scale deployments; no magic single-tenant TTFT improvement.
Use Cases: Single node (1-4 GPUs), no K8s, multi-tenant fairness needed, edge/local.
Future plans:
Short-term: 8-tenant test, Llama-3.1-70B on 4×A100, Mixtral MoE fairness.
v0.2.x: Multi-engine routing (vLLM ↔ SGLang).
v0.3: KV cache layering (LMCache), 32K context fairness.
