Reading

The Numerical Equivalence Illusion of FP16 KV Cache: A Study on Systematic Biases in Autoregressive Inference

This article reveals the numerical non-equivalence between KV cache and cache-free recomputation under FP16 precision. The study finds that due to the non-associativity of floating-point operations, the two execution paths lead to deterministic token sequence divergences, and the cache-enabled path has higher accuracy in most cases.

KV cacheFP16numerical equivalencefloating-point arithmeticnon-associativitytransformer inferenceautoregressive generation

Published 2026-04-16 23:59Recent activity 2026-04-20 10:52Estimated read 5 min

The Numerical Equivalence Illusion of FP16 KV Cache: A Study on Systematic Biases in Autoregressive Inference

Section 01

[Introduction] Core Insights of the Study on the Numerical Equivalence Illusion of FP16 KV Cache

This study challenges the default assumption of numerical equivalence between KV cache and cache-free recomputation in autoregressive Transformer inference. It finds that under FP16 precision, there are systematic and deterministic token sequence divergences between the two paths, and the cache-ON path has higher accuracy under most test conditions. The root cause lies in the non-associativity of floating-point operations, and this finding has important theoretical and practical implications for model deployment and evaluation.

Section 02

Background: The Role of KV Cache and the Overlooked Assumption

KV cache is a key technology for Transformer inference optimization, which improves the efficiency of long sequence generation by reusing KV vectors from previous tokens. The industry has long assumed numerical equivalence between KV cache and cache-free recomputation, but this study uses empirical evidence to show that there are systematic divergences between the two under FP16, breaking this assumption.

Section 03

Methodology: The Mathematical Root of Non-Associativity in Floating-Point Operations

Floating-point operations do not satisfy the mathematical associative law due to limited precision. The rounding errors of FP16 (16-bit precision) accumulate differently depending on the order of operations. The KV cache and cache-free paths have different operation orders (the former constructs the attention matrix by concatenation, while the latter uses full matrix multiplication), leading to numerical divergences under FP16.

Section 04

Experimental Evidence: Divergence Rate and Accuracy Differences

Experimental Setup: Models include LLaMA-2-7B, Mistral-7B-v0.3 (GQA), Gemma-2-2B; benchmark is GSM8K; sampling strategies include greedy decoding and random sampling. Key Findings: 1. Divergence rate is 100% under all conditions (even for greedy decoding); 2. The cache-ON path has higher accuracy in 8 out of 9 conditions; 3. The divergence rate drops sharply under FP32, with a token flip rate of 0, confirming that FP16 non-associativity is the main cause.

Section 05

In-Depth Analysis: Divergence Patterns Across Different Architectures

Mistral-7B (GQA architecture): Divergence amplifies sharply in the first layer, as multiple query heads sharing key heads magnify FP16 errors;
Gemma-2-2B: Divergence accumulates uniformly across layers, which is related to larger attention head dimensions and sliding window mechanisms.

Section 06

Activation Patching Experiment: Locating Causal Variables

Activation patching of the entire residual stream cannot restore the cache-free generation trajectory, indicating that the causal variable of the divergence lies in the stateful KV cache itself, rather than the transient errors in attention computation.

Section 07

Implications for LLM Inference Systems

Re-examine the numerical equivalence assumption: FP16 KV cache is a lossy optimization;
Precision-efficiency trade-off: FP16+KV may be the 'sweet spot' between speed and accuracy;
Determinism challenge: Need to pay attention to the impact of KV cache on strictly deterministic outputs.

Section 08

Conclusion: Balancing Efficiency Optimization and Numerical Behavior

This study breaks the numerical equivalence illusion of KV cache and reveals the impact of FP16 floating-point non-associativity. It reminds us that when pursuing efficiency optimization, we need to pay attention to underlying numerical behaviors—this is the only way to build reliable and interpretable AI systems.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49