Section 01
[Introduction] CATS: A New Framework for Accelerating LLM Inference in Memory-Constrained Scenarios
CATS (Cascaded Adaptive Tree Speculation) is an adaptive cascaded tree speculative decoding framework for memory-constrained scenarios. It achieves efficient speculative decoding through an innovative cascaded adapter architecture, significantly reducing the number of forward passes of large models while preserving the accuracy of the target distribution, providing new insights for accelerating LLM inference on memory-constrained devices.