NexusQuant's implementation includes several key steps:
Importance Scoring offers two options: fast scoring based on Key-Key proxy (no extra computation) or using a real attention scorer (higher quality but requires an additional forward pass).
RoPE Removal is another key trick. Since Rotary Position Encoding (RoPE) places Keys in different subspaces at different positions, direct quantization does not work well. NexusQuant first 'undoes' RoPE before quantization to bring all Keys back to a common subspace, then restores RoPE after quantization.
Boundary Protection is an optimization for specific model families. Qwen series models are particularly sensitive to quantization in certain layers, so the system provides a protect_boundary parameter that allows selecting to keep the first and last several layers in FP16 precision.