Section 01
ModeSwitch-LLM: Guide to Dynamic Optimization Solutions for Large Model Inference on a Single GPU
ModeSwitch-LLM is a lightweight request-level inference mode switching controller. By dynamically selecting modes like FP16, quantization, and speculative decoding based on request characteristics, it achieves a 2.1x latency speedup and 51.7% energy reduction on a single A100 GPU. Its core design includes multi-mode support and low-overhead feature extraction. Moreover, the rule-based controller outperforms learning-based routers, significantly improving inference efficiency while ensuring output quality.