System Architecture Design
The system architecture of mini-vllm follows a clear layered design:
Request Layer: Receives user input, providing RESTful interfaces and WebSocket streaming output via FastAPI. The design of the streaming API references ChatGPT's interaction mode, returning generation results token-by-token to enhance user experience.
Scheduling Layer: Maintains the request queue and implements dynamic batching logic. The scheduler determines the composition and execution timing of batches based on current system load, request priority, and latency constraints.
Cache Layer: Manages the lifecycle of KV cache, including allocation, update, compression, and release. For ultra-large-scale models, an optional scheme of offloading cache to SSD is also explored.
Inference Layer: Executes core Transformer computations, supporting multiple decoding strategies (greedy decoding, beam search, Top-K sampling, Top-P nucleus sampling).
Model Layer: Responsible for model loading and quantization conversion, integrating with the HuggingFace Transformers library to support multiple pre-trained models.
Diversity of Decoding Strategies
Different application scenarios require different text generation strategies. mini-vllm implements four main decoding methods:
Greedy Decoding: Selects the token with the highest probability each time, suitable for deterministic tasks such as code completion.
Beam Search: Maintains multiple candidate sequences and finally selects the complete sequence with the highest overall probability, suitable for translation tasks that require global optimality.
Top-K Sampling: Randomly selects from the K tokens with the highest probabilities, balancing diversity and quality.
Top-P (Nucleus Sampling): Samples from the smallest set of tokens whose cumulative probability reaches P. Compared to Top-K, it better adapts to the uncertainty distribution of different contexts.
The implementation of these strategies demonstrates how to support different generation behaviors under a unified framework while maintaining code modularity and extensibility.