Section 01
[Introduction] PALS: Core Introduction to an Energy-Efficient LLM Inference System for MoE Models
PALS is an energy-efficient LLM inference system implemented in the vLLM framework. Its core innovation lies in treating GPU power caps as first-class control variables and jointly optimizing them with software parameters such as batch size. This system requires no model retraining or API changes. It can improve energy efficiency by up to 26.3% on multi-GPU systems and dense/MoE models, while reducing QoS violations by 4-7 times, providing a new solution for energy efficiency optimization in LLM inference.