章节 01
KVWarden: Lightweight Single-GPU Multi-Tenant Fair Scheduling Without Kubernetes
KVWarden is a lightweight orchestration layer (≈3500 lines of code) running on vLLM/SGLang. It addresses multi-tenant fairness issues in LLM inference by token-bucket rate limiting, supports single-GPU multi-model lifecycle management (frequency+recency strategy), and provides an OpenAI-compatible HTTP API. It eliminates the need for Kubernetes, making it ideal for small teams or edge deployments.