Zing Forum

Reading

KVWarden: Single-GPU Multi-Tenant Fair Scheduling, an LLM Inference Orchestration Layer Without Kubernetes

A lightweight middleware that implements multi-tenant fair scheduling on top of vLLM/SGLang. It uses token-bucket rate limiting to ensure quiet users get predictable TTFT even under high load, without needing Kubernetes.

LLMinferencemulti-tenantfairnessvLLMSGLangGPUorchestrationrate-limiting
Published 2026-04-22 11:36Recent activity 2026-04-22 12:46Estimated read 6 min
KVWarden: Single-GPU Multi-Tenant Fair Scheduling, an LLM Inference Orchestration Layer Without Kubernetes
1

Section 01

KVWarden: Lightweight Single-GPU Multi-Tenant Fair Scheduling Without Kubernetes

KVWarden is a lightweight orchestration layer (≈3500 lines of code) running on vLLM/SGLang. It addresses multi-tenant fairness issues in LLM inference by token-bucket rate limiting, supports single-GPU multi-model lifecycle management (frequency+recency strategy), and provides an OpenAI-compatible HTTP API. It eliminates the need for Kubernetes, making it ideal for small teams or edge deployments.

2

Section 02

Background: The Fairness Challenge in Multi-Tenant LLM Inference

When multiple users/apps share a GPU for LLM inference, resource allocation fairness becomes critical. For example, a noisy neighbor (32 RPS) and quiet user (1 RPS) on Llama-3.1-8B: without fairness control, quiet user's TTFT p99 jumps from 53.9ms to 1585ms (29x increase). This is the problem KVWarden solves.

3

Section 03

Core Capabilities of KVWarden

KVWarden adds three key features to vLLM/SGLang:

  1. Tenant-level token-bucket rate limiting: Controls request rate at entry, reducing quiet user's TTFT p99 to 61.5ms (14% higher than no competition).
  2. Single-GPU multi-model lifecycle management: Uses frequency+recency for model switching and cache eviction (smarter than LRU).
  3. OpenAI-compatible HTTP API: Existing apps can integrate without code changes.
4

Section 04

Technical Architecture of KVWarden

Core components:

  • WorkloadRouter: Request analysis, length-aware scheduling, OpenAI API, streaming support.
  • AdmissionController: Concurrency limit, priority queue (lower number = higher priority), Prometheus metrics.
  • TenantManager: Tenant budget management, token-bucket implementation, DRR priority scoring.
  • CacheManager: Model KV cache lifecycle, snapshot on unloading, layered eviction. Request flow: Client → WorkloadRouter → TenantManager (rate limit check) → AdmissionController (queue/admit) → CacheManager (model/KV) → vLLM/SGLang. Requests use X-Tenant-ID header for tenant identification.
5

Section 05

Key Experimental Results

Experiments on A100-SXM4 80GB with Llama-3.1-8B (vLLM):

  1. Fairness test: 32 RPS noisy +1 RPS quiet. Token-bucket reduces quiet user's TTFT p99 from 1585ms (29x) to61.5ms (1.14x).
  2. Admission cap test: Global concurrency limits don't improve single-model performance (vLLM's batching is efficient).
  3. Benchmark framework validation: End-to-end tests with real vLLM ensure no systematic bias.
6

Section 06

Comparison with Existing Solutions

KVWarden fills a gap as a no-Kubernetes single-node multi-tenant fair scheduler:

System Needs K8s Multi-model Tenant Fairness Scenario
NVIDIA Dynamo Yes Yes No Data center
llm-d (CNCF) Yes Single pool No Cloud-native large scale
Mammoth Yes Yes No Multi-hardware
AIBrix Yes Yes No Enterprise
Ollama No LRU eviction No Local single node
vLLM/SGLang No Single model No Basic inference
KVWarden No Yes (freq+recency) Yes (token-bucket+DRR) Single node (1-4 GPUs)
7

Section 07

How to Use KVWarden

Quick start: pip install kvwarden kvwarden serve --config configs/quickstart_fairness.yaml Wait for health check: until curl -fs localhost:8000/health > /dev/null; do sleep2; done Send requests as tenants: Noisy: curl localhost:8000/v1/completions -H "X-Tenant-ID: noisy" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}' Quiet: curl localhost:8000/v1/completions -H "X-Tenant-ID: quiet" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}' Config example (yaml): tenants with rate limits, models with engine/gpu memory. CLI tools: version, doctor, man, bench.

8

Section 08

Limitations and Future Roadmap

Limitations: Not a vLLM/SGLang replacement; not for K8s-scale deployments; no magic single-tenant TTFT improvement. Use Cases: Single node (1-4 GPUs), no K8s, multi-tenant fairness needed, edge/local. Future plans: Short-term: 8-tenant test, Llama-3.1-70B on 4×A100, Mixtral MoE fairness. v0.2.x: Multi-engine routing (vLLM ↔ SGLang). v0.3: KV cache layering (LMCache), 32K context fairness.