Reading

Tempus: A Temporally Scalable GEMM Streaming Computing Framework for Edge AI

Tempus is a resource-invariant temporal GEMM framework on AMD Versal AI Edge SoC. It achieves scalability via fixed 16 AIE-ML cores and algorithm-level data partitioning. At 607 GOPS performance, it only consumes 10.677W power, realizing a 211.2x efficiency improvement compared to ARIES.

边缘AIGEMM加速AMD VersalAIE-ML时序扩展LLM推理矩阵乘法低功耗设计

Published 2026-05-01 17:28Recent activity 2026-05-04 10:48Estimated read 8 min

Tempus: A Temporally Scalable GEMM Streaming Computing Framework for Edge AI

Section 01

[Introduction] Tempus: Core Analysis of a Temporally Scalable GEMM Streaming Computing Framework for Edge AI

Tempus is a resource-invariant temporal GEMM framework on AMD Versal AI Edge SoC. It achieves temporal scalability via fixed 16 AIE-ML cores and algorithm-level data partitioning. At 607 GOPS performance, it only consumes 10.677W power, realizing a 211.2x efficiency improvement compared to ARIES, aiming to solve computation, memory, and power bottlenecks in edge AI deployment.

Section 02

Computing Power Dilemma of Edge AI and Limitations of Existing Solutions

The Scaling Law of Large Language Models (LLMs) indicates that model quality improves with increased computing scale, but deployment on edge devices faces strict constraints on computation, memory, and power. General Matrix Multiplication (GEMM) accounts for 90% of execution time in LLM inference; accelerating GEMM is key to the practical application of edge AI. The AIE of AMD Versal SoC provides a hardware foundation, but existing SOTA frameworks adopt spatial scaling strategies (distributing workloads to hundreds of cores), facing issues like physical implementation failure, bandwidth saturation, and excessive resource consumption on resource-constrained edge SoCs.

Section 03

Core Design of Tempus: From Spatial Scaling to Temporal Scaling

Tempus proposes the idea of shifting from spatial scaling to temporal scaling: using fixed 16 AIE-ML core compute blocks, and achieving scalability via iterative graph execution and algorithmic data partitioning and replication in Programmable Logic (PL). This design brings three major advantages:

Resource Invariance: The number of cores remains constant when matrix size changes, avoiding edge resource contention.
Efficient Data Flow: High-speed cascaded streams enable low-latency partial sum reduction with II=1.
Deadlock-Free Protocol: DATAFLOW protocol maximizes transfer-computation overlap and PLIO reuse.

Section 04

Technical Implementation Details of Tempus

Algorithmic Data Partitioning

Tempus implements intelligent data partitioning at the PL layer: input matrices are divided into tiles suitable for AIE-ML local memory, with efficient transfer between DDR and AIE arrays via DMA engines. The optimized layout considers computation dependencies and memory access patterns.

Cascaded Stream Architecture

Using the cascaded stream capability of AIE-ML cores, it achieves pipeline reduction of partial results: each core processes assigned tiles, intermediate results are passed to the next core for accumulation, and when II=1, one result is output per clock cycle.

Transfer-Computation Overlap

Maximizes transfer-computation overlap via double buffering mechanism and DATAFLOW protocol: while AIE cores process current blocks, DMA prepares the next blocks, hiding memory access latency.

Section 05

Performance Evaluation and Comparison Data with ARIES

Benchmark Results

Tempus achieves in GEMM workloads:

607 GOPS computing performance

10.677 W total on-chip power consumption

0.00% URAM/DSP utilization (fully relying on AIE-ML cores)

Comparison with ARIES

Via the Platform-Aware Utility (PAU) metric, Tempus compared to spatial SOTA solution ARIES:

211.2x improvement in significance factor
22.0x core frugality
7.1x power frugality
6.3x reduction in I/O demand

The difference stems from design philosophy: ARIES stacks hardware to pursue peak performance, while Tempus achieves sustainable scaling via algorithmic optimization and temporal scheduling.

Section 06

Significance of Tempus for Edge LLM Inference and Design Principles

Tempus establishes a sustainable and scalable foundation for edge LLM inference. In resource-constrained edge environments, the idea that "more cores = better performance" no longer applies; fine-grained algorithm design and hardware co-optimization can approach theoretical efficiency under fixed resources.

Enlightenments for edge AI design principles:

Algorithm-Hardware Co-Design: Fully utilize target hardware capabilities (e.g., AIE-ML cascaded streams) instead of porting general algorithms.
Prioritize Temporal Optimization: In resource-constrained scenarios, temporal scheduling is more cost-effective than spatial parallelism.
Scalability ≠ Resource Scaling: True scalability should be at the algorithmic level, not hardware stacking.

Section 07

Limitations of Tempus and Future Directions

Tempus currently focuses on GEMM operators; future explorations can include:

Extending the temporal scaling strategy to other compute-intensive operators like convolution and attention mechanisms
Combining sparsity utilization to further reduce computation and memory requirements
Exploring dynamic resource scheduling in multi-task scenarios

Section 08

Conclusion: Tempus Provides a New Paradigm for Edge AI Acceleration

Tempus provides an efficient and sustainable solution for GEMM acceleration in edge AI via resource-invariant temporal scaling strategy. At the balance point of 607 GOPS performance and 10.677W power consumption, it proves that edge LLM inference can be achieved through algorithmic innovation and hardware co-optimization without expensive hardware stacking, providing an important reference paradigm for the design of next-generation edge AI accelerators.