Zing Forum

Reading

The Numerical Equivalence Illusion of FP16 KV Cache: A Study on Systematic Biases in Autoregressive Inference

This article reveals the numerical non-equivalence between KV cache and cache-free recomputation under FP16 precision. The study finds that due to the non-associativity of floating-point operations, the two execution paths lead to deterministic token sequence divergences, and the cache-enabled path has higher accuracy in most cases.

KV cacheFP16numerical equivalencefloating-point arithmeticnon-associativitytransformer inferenceautoregressive generation
Published 2026-04-16 23:59Recent activity 2026-04-20 10:52Estimated read 5 min
The Numerical Equivalence Illusion of FP16 KV Cache: A Study on Systematic Biases in Autoregressive Inference
1

Section 01

[Introduction] Core Insights of the Study on the Numerical Equivalence Illusion of FP16 KV Cache

This study challenges the default assumption of numerical equivalence between KV cache and cache-free recomputation in autoregressive Transformer inference. It finds that under FP16 precision, there are systematic and deterministic token sequence divergences between the two paths, and the cache-ON path has higher accuracy under most test conditions. The root cause lies in the non-associativity of floating-point operations, and this finding has important theoretical and practical implications for model deployment and evaluation.

2

Section 02

Background: The Role of KV Cache and the Overlooked Assumption

KV cache is a key technology for Transformer inference optimization, which improves the efficiency of long sequence generation by reusing KV vectors from previous tokens. The industry has long assumed numerical equivalence between KV cache and cache-free recomputation, but this study uses empirical evidence to show that there are systematic divergences between the two under FP16, breaking this assumption.

3

Section 03

Methodology: The Mathematical Root of Non-Associativity in Floating-Point Operations

Floating-point operations do not satisfy the mathematical associative law due to limited precision. The rounding errors of FP16 (16-bit precision) accumulate differently depending on the order of operations. The KV cache and cache-free paths have different operation orders (the former constructs the attention matrix by concatenation, while the latter uses full matrix multiplication), leading to numerical divergences under FP16.

4

Section 04

Experimental Evidence: Divergence Rate and Accuracy Differences

Experimental Setup: Models include LLaMA-2-7B, Mistral-7B-v0.3 (GQA), Gemma-2-2B; benchmark is GSM8K; sampling strategies include greedy decoding and random sampling. Key Findings: 1. Divergence rate is 100% under all conditions (even for greedy decoding); 2. The cache-ON path has higher accuracy in 8 out of 9 conditions; 3. The divergence rate drops sharply under FP32, with a token flip rate of 0, confirming that FP16 non-associativity is the main cause.

5

Section 05

In-Depth Analysis: Divergence Patterns Across Different Architectures

  • Mistral-7B (GQA architecture): Divergence amplifies sharply in the first layer, as multiple query heads sharing key heads magnify FP16 errors;
  • Gemma-2-2B: Divergence accumulates uniformly across layers, which is related to larger attention head dimensions and sliding window mechanisms.
6

Section 06

Activation Patching Experiment: Locating Causal Variables

Activation patching of the entire residual stream cannot restore the cache-free generation trajectory, indicating that the causal variable of the divergence lies in the stateful KV cache itself, rather than the transient errors in attention computation.

7

Section 07

Implications for LLM Inference Systems

  1. Re-examine the numerical equivalence assumption: FP16 KV cache is a lossy optimization;
  2. Precision-efficiency trade-off: FP16+KV may be the 'sweet spot' between speed and accuracy;
  3. Determinism challenge: Need to pay attention to the impact of KV cache on strictly deterministic outputs.
8

Section 08

Conclusion: Balancing Efficiency Optimization and Numerical Behavior

This study breaks the numerical equivalence illusion of KV cache and reveals the impact of FP16 floating-point non-associativity. It reminds us that when pursuing efficiency optimization, we need to pay attention to underlying numerical behaviors—this is the only way to build reliable and interpretable AI systems.