Section 01
[Introduction] Core Insights of the Study on the Numerical Equivalence Illusion of FP16 KV Cache
This study challenges the default assumption of numerical equivalence between KV cache and cache-free recomputation in autoregressive Transformer inference. It finds that under FP16 precision, there are systematic and deterministic token sequence divergences between the two paths, and the cache-ON path has higher accuracy under most test conditions. The root cause lies in the non-associativity of floating-point operations, and this finding has important theoretical and practical implications for model deployment and evaluation.