正文

SMEPilot：基于可扩展矩阵扩展的LLM推理优化引擎

SMEPilot通过 Roofline 模型分析SME-enabled CPU特性，智能选择CPU/SME/协同执行模式，实现算子级优化。在Llama-3.2-3B、Qwen3-4B等模型上，手机、PC、服务器平台端到端推理性能提升最高达3.94倍。

LLM推理矩阵扩展SMECPU优化Roofline模型异构计算LLM inferenceScalable Matrix Extension

发布时间 2026/06/15 15:35最近活动 2026/06/16 12:24预计阅读 5 分钟

章节 01

SMEPilot: An Optimized LLM Inference Engine Using Scalable Matrix Extensions

SMEPilot is an LLM inference optimization engine that leverages Roofline model analysis of SME-enabled CPU features to intelligently select CPU/SME/collaborative execution modes for operator-level optimization. It achieves up to 3.94x end-to-end inference performance improvement across mobile, PC, and server platforms on models like Llama-3.2-3B and Qwen3-4B.

章节 02

Background: Rise of CPU Matrix Extensions & Challenges in LLM Inference

Modern CPUs (e.g., Arm with Scalable Matrix Extension/SME) have matrix extension instructions that boost matrix运算能力. However, LLM inference involves diverse operations (prefill, decode, attention, KV-cache) with varying compute/memory characteristics, and SME units compete with CPU cores for shared memory bandwidth. This complexity makes it hard to fully utilize SME's potential without fine-grained strategy selection.

章节 03

SMEPilot's Adaptive Execution Strategy Selection

SMEPilot offers three execution modes: CPU-only (for low compute density/bandwidth-limited operations), SME-only (for high compute density tasks like large matrix multiplications), and collaborative (SME+CPU parallel execution). It uses the Roofline model to analyze each operator's arithmetic intensity, vectorization level, and data layout requirements to choose the optimal mode.

章节 04

Key Technical Optimizations in SMEPilot

Tile-level task division: In collaborative mode, split matrix workloads into tiles—SME handles regular blocks, CPU handles irregular parts for load balancing. 2. Attention pipeline overlap: SME processes matrix stages (Query-Key/Attention-Value matmuls) while CPU handles vector stages (Softmax, masking) in parallel. 3. Layout state maintenance: Tracks tensor layouts and reuses packed tensors to reduce conversion overhead.

章节 05

Experimental Results: Cross-Platform Performance Improvements

Tested on mobile, PC, server platforms with models like Llama-3.2-3B, Qwen3-4B, Qwen3-30BA3B. SMEPilot achieves up to 3.94x end-to-end performance gain over baseline. The improvement is consistent across platforms and model scales, proving its generality.

章节 06

Technical Contributions & Conclusion

Contributions: 1. Systematic analysis of SME-enabled CPU performance for LLM inference. 2. Adaptive execution strategy for heterogeneous resources.3. Generalizable optimizations (tile division, pipeline overlap, layout maintenance). Conclusion: SMEPilot's intelligent strategy selection and optimizations unlock SME's potential, providing an efficient way to deploy LLMs on CPU platforms with up to 3.94x speedup.

章节 07

Limitations & Future Directions

Limitations:1. Current implementation is Arm SME-specific (needs adaptation for Intel AMX etc.).2. Strategy selection is offline (not runtime adaptive).3. Memory bandwidth bottlenecks limit SME's potential in some cases. Future Directions: Adapt to other matrix extensions, explore runtime adaptive strategies, combine with model compression to address bandwidth issues.