Zing Forum

Reading

On-Device LLM Inference Test: Mobile Thermal Management is the Main Bottleneck, NPU Energy Efficiency Ratio Shines

Tests of Qwen 2.5 1.5B on Raspberry Pi NPU, Samsung S24 Ultra, iPhone 16 Pro, and RTX 4050 show that the iPhone's throughput halved after two iterations, the S24 encountered system-enforced frequency reduction, and the Hailo-10H NPU achieved an energy efficiency ratio comparable to RTX 4050 with power consumption below 2W

端侧推理移动NPU热管理能效比Qwen
Published 2026-03-25 02:28Recent activity 2026-03-27 13:22Estimated read 3 min
On-Device LLM Inference Test: Mobile Thermal Management is the Main Bottleneck, NPU Energy Efficiency Ratio Shines
1

Section 01

Introduction / Main Floor: On-Device LLM Inference Test: Mobile Thermal Management is the Main Bottleneck, NPU Energy Efficiency Ratio Shines

Tests of Qwen 2.5 1.5B on Raspberry Pi NPU, Samsung S24 Ultra, iPhone 16 Pro, and RTX 4050 show that the iPhone's throughput halved after two iterations, the S24 encountered system-enforced frequency reduction, and the Hailo-10H NPU achieved an energy efficiency ratio comparable to RTX 4050 with power consumption below 2W

2

Section 02

Test Setup

Model: Qwen 2.5 1.5B (4-bit quantization) Platforms:

  • Raspberry Pi 5 + Hailo-10H NPU
  • Samsung Galaxy S24 Ultra
  • iPhone 16 Pro
  • Laptop RTX 4050 GPU

Test Conditions: 258-token prompt, 20 rounds of thermal iteration

3

Section 03

Mobile Devices: Thermal Management is the Primary Constraint

  • iPhone 16 Pro: Throughput dropped by nearly 50% after two iterations
  • S24 Ultra: Encountered system-enforced GPU frequency floor, inference terminated completely
4

Section 04

Dedicated Hardware: Different Constraints Dominate

Platform Throughput Power Consumption Features
RTX 4050 131.7 tok/s 34.1 W Limited by battery power upper limit
Hailo-10H 6.9 tok/s <2 W Limited by module memory bandwidth, near-zero variance
5

Section 05

Energy Efficiency Ratio Surprise

Hailo-10H NPU performed brilliantly:

  • Energy efficiency ratio comparable to RTX 4050
  • Throughput is only 1/19th
  • Power consumption is less than 2W
6

Section 06

Deployment Insights

For always-on personal assistant scenarios:

  • Peak computing power is less important than thermal management capability
  • NPUs have significant advantages in energy efficiency ratio
  • Platform-level optimization requires coordinated consideration of hardware and software