Section 01
Introduction to the GELATO Framework: An Adaptive Token Offloading Scheme for Edge-Cloud Collaborative Speculative Decoding
GELATO (An Adaptive Token Offloading Framework for Edge-Cloud Collaborative Speculative Decoding Based on Generative Entropy and Lyapunov) achieves maximum decoding throughput under energy constraints in resource-constrained edge-cloud collaborative speculative decoding systems through a drift-penalty cycle and nested entropy-driven generation mechanism. Experimental results show that this framework increases throughput by 64.98% and reduces energy consumption by 47.47%, providing a new solution for inference optimization of edge-side Large Language Models (LLMs).