Zing Forum

Reading

PALS: An Energy-Efficient LLM Inference System for MoE Models

PALS treats GPU power caps as first-class control variables, jointly optimizing them with software parameters like batch size. Implemented in the vLLM framework, it requires no model retraining or API changes. It can improve energy efficiency by up to 26.3% on multi-GPU systems and dense/MoE models, while reducing QoS violations by 4-7 times.

LLM推理能效优化GPU功耗管理MoE模型vLLM数据中心绿色AI
Published 2026-05-21 01:19Recent activity 2026-05-21 10:47Estimated read 5 min
PALS: An Energy-Efficient LLM Inference System for MoE Models
1

Section 01

[Introduction] PALS: Core Introduction to an Energy-Efficient LLM Inference System for MoE Models

PALS is an energy-efficient LLM inference system implemented in the vLLM framework. Its core innovation lies in treating GPU power caps as first-class control variables and jointly optimizing them with software parameters such as batch size. This system requires no model retraining or API changes. It can improve energy efficiency by up to 26.3% on multi-GPU systems and dense/MoE models, while reducing QoS violations by 4-7 times, providing a new solution for energy efficiency optimization in LLM inference.

2

Section 02

Background: Energy Consumption Challenges in LLM Inference and Requirements for MoE Models

With the rapid popularization of LLMs in various applications, inference services have become the dominant workload in data centers, and the energy consumption problem of GPU clusters is prominent. Traditional inference optimization systems focus on throughput and latency, treating GPU power consumption as a static constraint and lacking flexible response capabilities. The rise of MoE architecture models has made inference energy consumption patterns more complex, and the demand for fine-grained power management has become increasingly urgent.

3

Section 03

Methodology: Core Technical Mechanisms of PALS

PALS includes an offline power-performance modeling module and an online feedback-driven controller. In the offline phase, it builds a power-performance correlation model to capture the Pareto frontier (including the impact of MoE expert routing); in the online phase, it dynamically adjusts power caps and batch sizes; it seamlessly integrates with vLLM via a plugin, compatible with the existing ecosystem.

4

Section 04

Evidence: Key Findings from PALS Experimental Evaluation

In tests on H100/H800 multi-node systems, PALS achieved a maximum energy efficiency improvement of 26.3% compared to the baseline system; under strict power constraints, QoS violations were reduced by 4-7 times; it can respond to changes in power budgets in real time, adjusting to new targets within seconds while maintaining service continuity.

5

Section 05

Conclusion: Implications of PALS for AI Infrastructure

PALS proves that power control and inference performance are not a zero-sum game, providing a technical foundation for "grid-interactive AI". As model scales grow, energy efficiency optimization will become a core design constraint, and the power-aware paradigm is expected to become a standard configuration for next-generation LLM service systems.

6

Section 06

Limitations and Future Directions

Currently, PALS is adapted to NVIDIA GPUs and needs to be extended to other hardware; the response speed to extreme burst traffic needs optimization. In the future, it can be combined with predictive load modeling for pre-allocation, and explore collaborative optimization with technologies such as model quantization and sparsification.