Zing 论坛

正文

KVWarden:单GPU多租户公平调度,无需Kubernetes的LLM推理编排层

一个轻量级中间件,在vLLM/SGLang之上实现租户公平调度,通过token-bucket限流确保安静用户在高负载下仍能获得可预测的TTFT,无需Kubernetes。

LLMinferencemulti-tenantfairnessvLLMSGLangGPUorchestrationrate-limiting
发布时间 2026/04/22 11:36最近活动 2026/04/22 12:46预计阅读 6 分钟
KVWarden:单GPU多租户公平调度,无需Kubernetes的LLM推理编排层
1

章节 01

KVWarden: Lightweight Single-GPU Multi-Tenant Fair Scheduling Without Kubernetes

KVWarden is a lightweight orchestration layer (≈3500 lines of code) running on vLLM/SGLang. It addresses multi-tenant fairness issues in LLM inference by token-bucket rate limiting, supports single-GPU multi-model lifecycle management (frequency+recency strategy), and provides an OpenAI-compatible HTTP API. It eliminates the need for Kubernetes, making it ideal for small teams or edge deployments.

2

章节 02

Background: The Fairness Challenge in Multi-Tenant LLM Inference

When multiple users/apps share a GPU for LLM inference, resource allocation fairness becomes critical. For example, a noisy neighbor (32 RPS) and quiet user (1 RPS) on Llama-3.1-8B: without fairness control, quiet user's TTFT p99 jumps from 53.9ms to 1585ms (29x increase). This is the problem KVWarden solves.

3

章节 03

Core Capabilities of KVWarden

KVWarden adds three key features to vLLM/SGLang:

  1. Tenant-level token-bucket rate limiting: Controls request rate at entry, reducing quiet user's TTFT p99 to 61.5ms (14% higher than no competition).
  2. Single-GPU multi-model lifecycle management: Uses frequency+recency for model switching and cache eviction (smarter than LRU).
  3. OpenAI-compatible HTTP API: Existing apps can integrate without code changes.
4

章节 04

Technical Architecture of KVWarden

Core components:

  • WorkloadRouter: Request analysis, length-aware scheduling, OpenAI API, streaming support.
  • AdmissionController: Concurrency limit, priority queue (lower number = higher priority), Prometheus metrics.
  • TenantManager: Tenant budget management, token-bucket implementation, DRR priority scoring.
  • CacheManager: Model KV cache lifecycle, snapshot on unloading, layered eviction. Request flow: Client → WorkloadRouter → TenantManager (rate limit check) → AdmissionController (queue/admit) → CacheManager (model/KV) → vLLM/SGLang. Requests use X-Tenant-ID header for tenant identification.
5

章节 05

Key Experimental Results

Experiments on A100-SXM4 80GB with Llama-3.1-8B (vLLM):

  1. Fairness test: 32 RPS noisy +1 RPS quiet. Token-bucket reduces quiet user's TTFT p99 from 1585ms (29x) to61.5ms (1.14x).
  2. Admission cap test: Global concurrency limits don't improve single-model performance (vLLM's batching is efficient).
  3. Benchmark framework validation: End-to-end tests with real vLLM ensure no systematic bias.
6

章节 06

Comparison with Existing Solutions

KVWarden fills a gap as a no-Kubernetes single-node multi-tenant fair scheduler:

System Needs K8s Multi-model Tenant Fairness Scenario
NVIDIA Dynamo Yes Yes No Data center
llm-d (CNCF) Yes Single pool No Cloud-native large scale
Mammoth Yes Yes No Multi-hardware
AIBrix Yes Yes No Enterprise
Ollama No LRU eviction No Local single node
vLLM/SGLang No Single model No Basic inference
KVWarden No Yes (freq+recency) Yes (token-bucket+DRR) Single node (1-4 GPUs)
7

章节 07

How to Use KVWarden

Quick start: pip install kvwarden kvwarden serve --config configs/quickstart_fairness.yaml Wait for health check: until curl -fs localhost:8000/health > /dev/null; do sleep2; done Send requests as tenants: Noisy: curl localhost:8000/v1/completions -H "X-Tenant-ID: noisy" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}' Quiet: curl localhost:8000/v1/completions -H "X-Tenant-ID: quiet" -d '{"model":"llama31-8b","prompt":"...","max_tokens":64}' Config example (yaml): tenants with rate limits, models with engine/gpu memory. CLI tools: version, doctor, man, bench.

8

章节 08

Limitations and Future Roadmap

Limitations: Not a vLLM/SGLang replacement; not for K8s-scale deployments; no magic single-tenant TTFT improvement. 适用场景: Single node (1-4 GPUs), no K8s, multi-tenant fairness needed, edge/local. Future plans: 近期: 8-tenant test, Llama-3.1-70B on4×A100, Mixtral MoE fairness. v0.2.x: Multi-engine routing (vLLM ↔ SGLang). v0.3: KV cache分层 (LMCache), 32K context fairness.