Zing Forum

Reading

GELATO: An Adaptive Token Offloading Framework for Edge-Cloud Collaborative Speculative Decoding Based on Generative Entropy and Lyapunov

The GELATO framework achieves maximum decoding throughput under energy constraints in resource-constrained edge-cloud collaborative speculative decoding systems through a drift-penalty cycle and nested entropy-driven generation mechanism, with a 64.98% increase in throughput and a 47.47% reduction in energy consumption.

端边协同推理推测解码李雅普诺夫优化生成熵端侧AI资源调度能量效率
Published 2026-05-11 15:38Recent activity 2026-05-12 10:51Estimated read 9 min
GELATO: An Adaptive Token Offloading Framework for Edge-Cloud Collaborative Speculative Decoding Based on Generative Entropy and Lyapunov
1

Section 01

Introduction to the GELATO Framework: An Adaptive Token Offloading Scheme for Edge-Cloud Collaborative Speculative Decoding

GELATO (An Adaptive Token Offloading Framework for Edge-Cloud Collaborative Speculative Decoding Based on Generative Entropy and Lyapunov) achieves maximum decoding throughput under energy constraints in resource-constrained edge-cloud collaborative speculative decoding systems through a drift-penalty cycle and nested entropy-driven generation mechanism. Experimental results show that this framework increases throughput by 64.98% and reduces energy consumption by 47.47%, providing a new solution for inference optimization of edge-side Large Language Models (LLMs).

2

Section 02

Challenges of Edge-Side AI Inference and Current State of Speculative Decoding

Rise and Challenges of Edge-Side AI Inference

With the improvement of LLM capabilities, the demand for edge-side deployment is urgent. However, edge-side devices have limited computing resources and battery capacity, making it challenging to run large models. Edge-cloud collaborative inference architectures have emerged, intelligently distributing tasks between terminals and edge servers, and speculative decoding is one of the most promising technical routes among them.

Working Principle of Speculative Decoding

Using edge-side lightweight draft models to quickly generate candidate token sequences and submitting them to edge target models for batch verification can reduce latency, optimize bandwidth, and maintain output quality. However, existing static strategies (fixed draft models, verification thresholds, etc.) cannot adapt to dynamic generation uncertainties, leading to low resource utilization efficiency.

3

Section 03

Core of the GELATO Framework: Two-Level Adaptive Mechanism

GELATO framework addresses edge-cloud environment challenges through a two-level adaptive mechanism:

Outer Loop: Drift-Penalty Decision

Adopting the Lyapunov optimization framework, it maintains an energy deficit queue to track energy consumption deviations. It adjusts resource allocation through a drift term (penalizing the growth of energy deficit) and a penalty term (trading off throughput gains) to achieve online optimization under long-term energy constraints.

Inner Mechanism: Entropy-Driven Generation

Using generative entropy to quantify token uncertainty: when entropy is low, the draft model exits early and submits for verification; when entropy is high, it increases computational depth and dynamically adjusts the sampling strategy to achieve refined resource allocation.

4

Section 04

Theoretical Performance Guarantees of GELATO

The GELATO framework has a solid theoretical foundation:

  1. Long-term Throughput Optimality: Under the premise of satisfying energy constraints, the throughput converges to the theoretical optimal value;
  2. Energy Constraint Satisfaction: The long-term average energy consumption does not exceed the preset budget;
  3. Queue Stability: The energy deficit queue remains bounded, ensuring stable system operation.
5

Section 05

Experimental Evidence: Performance Improvement and Adaptability Verification of GELATO

In evaluations on real hardware platforms, GELATO performs significantly:

  • 64.98% Throughput Increase: Compared to advanced distributed speculative decoding architectures, resource allocation is more intelligent;
  • 47.47% Energy Consumption Reduction: Energy consumption is halved at the same throughput, extending the battery life of edge-side devices;
  • Decoding Quality Preservation: The target model verification mechanism ensures output quality is consistent with the baseline system;
  • Strong Adaptability: Adapts to different workloads (short text generation/long document continuation) and energy constraints.
6

Section 06

Technical Details and Implications for Edge-Side AI Deployment

Technical Implementation Details

  • Real-time Entropy Calculation: Obtains probability distribution through the softmax layer with low computational overhead;
  • Lyapunov Queue Maintenance: Updated in decision cycles with negligible control overhead;
  • Integration with Speculative Decoding: Compatible with existing implementations, adjusting draft budget and computational depth.

Implications for Edge-Side AI Deployment

  • Adaptive strategies are superior to static configurations;
  • Optimization theory and information theory guide system design;
  • Edge-cloud collaboration requires intelligent task allocation to leverage the advantages of both sides.
7

Section 07

Limitations of GELATO and Future Research Directions

Limitations

  • Assumes stable network connections and does not fully consider network fluctuations;
  • The choice of entropy threshold affects performance, and adaptive thresholds need to be explored;
  • Fairness and resource allocation in multi-user scenarios are not addressed.

Future Directions

  • Combine reinforcement learning to optimize decision strategies;
  • Extend to edge-cloud collaborative inference for multimodal models;
  • Study privacy-preserving collaborative inference in federated learning scenarios.
8

Section 08

Significance and Future Outlook of GELATO

GELATO represents an important advancement in the field of edge-side LLM inference optimization. Its two-level adaptive mechanism achieves significant performance improvements under energy constraints and provides a theoretical framework. As LLMs become prevalent in mobile devices and edge scenarios, such resource optimization technologies will drive the adoption of AI across a wider range of devices and scenarios, enhancing user experience.