Reading

Research on Activation Value Measurement of Open-Source Large Models: Revealing Hidden Risks in Quantization Deployment

This article introduces a systematic measurement study on the dynamic range of activation values in modern open-source large language models. It finds that the maximum activation values between different model families can differ by nearly four orders of magnitude, which has important guiding significance for low-bit quantization deployment.

大语言模型量化部署激活值MoEINT-8模型推理开源模型

Published 2026-05-15 11:31Recent activity 2026-05-18 11:18Estimated read 8 min

Research on Activation Value Measurement of Open-Source Large Models: Revealing Hidden Risks in Quantization Deployment

Section 01

[Introduction] Core Points of the Research on Activation Value Measurement of Open-Source Large Models

This article conducts a systematic measurement study on the dynamic range of activation values in modern open-source large language models. It finds that the maximum activation values of different model families differ by nearly four orders of magnitude, activation values of MoE architectures are significantly lower than those of Dense models of the same scale, and residual streams carry the global maximum activation values. These findings have important guiding significance for low-bit quantization deployment, emphasizing that activation values should be measured and reported as model attributes.

Section 02

Research Background and Motivation

In the actual deployment of large language models, the dynamic range of activation values directly affects low-bit quantization, scaling, and inference stability. Early studies were based on LLaMA series before 2024 and did not verify the rules of new architectures such as Qwen, Gemma, and Mixtral; existing quantization toolchains rely on early conclusions, which may lead to deployment issues. Core research questions: The magnitude of activation values of modern open-source models and the differences between different model families/generations/training stages.

Section 03

Construction of a Unified Measurement Framework

Dataset and Preprocessing

A multi-domain corpus of 5000 samples (covering news, encyclopedias, etc.) is used, and family-specific tokenization strategies are implemented to avoid bias.

Full Coverage of Measurement Positions

Measurement hooks are set at key positions such as embedding layers, hidden states, attention mechanisms, MLP/MoE modules, SwiGLU gates, and normalization layers to fully observe the propagation path of activation values.

Breadth of Model Coverage

Covers 27 checkpoints from 8 mainstream open-source model families, including Dense architectures (LLaMA, Qwen, Gemma), MoE architectures (Mixtral, Qwen-MoE), vision-language models, and versions from different training stages.

Section 04

Core Findings: Family Differences and Patterns of Activation Values

Finding 1: Cross-family differences of nearly four orders of magnitude

When the number of parameters is similar, the maximum activation values of different families differ significantly: Qwen3.5 series and MoE models are concentrated in the 10²-10³ magnitude range, while Gemma3-27B-it is as high as about 7×10⁵, challenging the intuition that "the larger the model, the larger the activation value range."

Finding 2: Natural advantages of MoE architectures

At the same scale, the maximum activation values of MoE checkpoints are 14.0-23.4 times lower than those of Dense models, possibly due to the sparse activation of the gating mechanism suppressing large values.

Finding 3: Residual streams carry the global maximum value

In 22 out of 24 checkpoints, residual streams carry the global maximum activation value. The engineering significance is that residual streams determine the boundary of the model's numerical stability.

Section 05

Implications for Low-Bit Quantization Deployment

INT-8 quantization verification shows that the measured maximum activation value and low-bit reconstruction error are significantly covariant. Choosing a scaling strategy based on actual measurements can effectively reduce information loss. Recommendations:

Model publishers should clearly report the maximum activation value in the model card;
Different model families need differentiated quantization configurations to avoid precision degradation caused by a "one-size-fits-all" approach;
Pre-measure the activation value distribution using representative data before deployment, instead of relying on empirical values.

Section 06

Research Limitations and Future Directions

Limitations

Based on a static 5000-sample corpus, it may not capture activation behaviors of specific domains or extreme inputs;
Focuses mainly on maximum values, without in-depth analysis of the complete distribution pattern of activation values (such as long-tail characteristics, outlier frequency).

Future Directions

Extend the measurement framework to models with 100B+ parameters;
Study the causal relationship between the dynamic range of activation values and training data, optimizer selection;
Develop adaptive quantization algorithms based on activation value characteristics.

Section 07

Conclusion

This study reveals the huge differences in the dynamic range of activation values of modern open-source large language models through systematic measurements, providing important empirical basis for low-bit quantization deployment. Core conclusion: The maximum activation value is a model attribute that should be measured and reported, not a minor detail. Model developers and deployment engineers need to include activation value analysis in standard processes. The research code has been open-sourced to help the community understand the numerical characteristics of models and balance efficiency and precision.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15