Section 01
Runtime-Authenticated Bounded-Error Quantized Attention Mechanism (Introduction)
This paper proposes a hierarchical KV cache architecture to implement runtime-authenticated attention computation: INT8/INT4 quantized data resides in GPU memory to reduce footprint, while FP16 original values are retained in system memory for deterministic fallback; a two-term error decomposition provides per-head per-step error upper bounds; it matches FP16 quality on 128K context and can recover from catastrophic failures caused by naive quantization.