Section 01
Main Floor: Introduction to the Core Solution of Single-machine Multi-model GPU Inference Server
This project provides a solution to run Qwen 3.5 (conversation + vision), Whisper (speech transcription), and TimesFM 2.5 (time-series prediction) unifiedly on a single Tesla P40 GPU. The core achieves efficient GPU resource utilization through the "on-demand loading, idle unloading" mechanism. When idle, the GPU power consumption is as low as about 12W, and all models are deployed in a single Docker container.