Zing Forum

Reading

Phase Transition in Attention: A Bayesian Theory of Copy Head Emergence

This study proposes a Bayesian theory for attention feature learning. By analyzing the training of a single-layer softmax attention network on the copy task, it finds that softmax attention exhibits a first-order phase transition, while linear attention undergoes a second-order phase transition followed by smooth evolution, providing a first-principles explanation for the sudden emergence of copy circuits in Transformers.

attention mechanismphase transitionBayesian theorycopy headinduction headtransformerin-context learning
Published 2026-06-10 21:26Recent activity 2026-06-11 09:23Estimated read 6 min
Phase Transition in Attention: A Bayesian Theory of Copy Head Emergence
1

Section 01

[Introduction] Bayesian Theory of Attention Phase Transition: A First-Principles Explanation for Copy Head Emergence

The title of this paper is 'Phase Transition in Attention: A Bayesian Theory of Copy Head Emergence', released by the arXiv author team on June 10, 2026 (original link: http://arxiv.org/abs/2606.12058v1). The core idea is: By analyzing the training of a single-layer softmax attention network on the copy task using Bayesian feature learning theory, it is found that softmax attention exhibits a first-order phase transition (abrupt pattern change), while linear attention undergoes a second-order phase transition followed by smooth evolution, providing a first-principles explanation for the sudden emergence of copy circuits in Transformers.

2

Section 02

Research Background: Attention Emergence Phenomenon and the Importance of Copy Heads

The attention mechanism in the Transformer architecture is the core of in-context learning. During training, attention patterns are observed to emerge suddenly rather than evolve gradually, but there is a lack of theoretical explanation. The copy sub-circuit is a key component of the Transformer's induction head, responsible for identifying and copying input sequence patterns, and is the foundation of in-context learning ability. Understanding its formation mechanism is crucial for the learning mechanism of Transformers.

3

Section 03

Theoretical Framework and Research Methods

The research team proposes a Bayesian feature learning theory, treating attention weight learning as a Bayesian inference problem. The study setup involves training a single-layer softmax attention network on the copy task. By deriving the closed-form posterior distribution of the attention matrix, the problem is reduced to a low-dimensional order parameter space for analysis, simplifying the model while retaining core features.

4

Section 04

Core Findings: Phase Transition Phenomena and Comparison of Two Attention Mechanisms

As the amount of training data increases, the system undergoes a phase transition: before the transition, attention is disordered; after the transition point, copy circuits form. Experimental validations (Bayesian sampling and Adam training) consistently support this conclusion. Comparative analysis: Softmax attention exhibits a first-order phase transition (abrupt pattern change, similar to water freezing); linear attention initially undergoes a second-order phase transition (continuous change, similar to Curie temperature transition) followed by smooth evolution. The nonlinearity of softmax leads to discontinuous phase transition, explaining the sudden emergence of patterns.

5

Section 05

Connection to Large Language Models and Theoretical Contributions

Implications for large models: Emergent abilities may be related to phase transitions of attention heads, and there exists a critical data volume threshold; the theoretical framework can predict the timing of ability emergence. Theoretical contributions: Provides a first-principles framework, low-dimensional reduction technology enables analysis of complex dynamics, and interdisciplinary borrowing of statistical physics phase transition theory to explain neural network behavior.

6

Section 06

Limitations and Future Research Directions

Current limitations: The single-layer network differs from real multi-layer Transformers, only the copy task is analyzed, and some derivations rely on assumptions. Future directions: Extend to multi-layer architectures, analyze more in-context learning tasks, use phase transition theory to guide training strategies (e.g., data scheduling), explore phase transition behaviors of other components, and study critical phenomena at phase transition points (e.g., scaling laws).