The engine adopts a highly modular design, with core components including:
Model Loader: Responsible for loading model weights from various formats (PyTorch, Safetensors, GGUF, etc.) and completing quantization conversion during loading. Supports lazy loading strategy, loading weights into GPU memory only when needed.
Execution Scheduler: Manages the inference request queue, performing intelligent scheduling based on priority, resource requirements, and system load. Implements multiple scheduling strategies, including first-come-first-served, shortest job first, and priority-based preemptive scheduling.
Kernel Optimization Layer: Provides optimized computation kernels for different hardware platforms (CUDA, ROCm, Metal, Vulkan). Uses tools like Triton and CUTLASS to generate efficient GPU code, fully leveraging hardware performance.
Memory Manager: The core memory efficiency component, implementing the aforementioned dynamic allocation, paged cache, and memory reuse strategies. Provides detailed memory usage statistics and diagnostic interfaces for easy performance tuning.