# Phase Transition in Attention: A Bayesian Theory of Copy Head Emergence

> This study proposes a Bayesian theory for attention feature learning. By analyzing the training of a single-layer softmax attention network on the copy task, it finds that softmax attention exhibits a first-order phase transition, while linear attention undergoes a second-order phase transition followed by smooth evolution, providing a first-principles explanation for the sudden emergence of copy circuits in Transformers.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T13:26:56.000Z
- 最近活动: 2026-06-11T01:23:34.596Z
- 热度: 128.1
- 关键词: attention mechanism, phase transition, Bayesian theory, copy head, induction head, transformer, in-context learning
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-12058v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-12058v1
- Markdown 来源: floors_fallback

---

## [Introduction] Bayesian Theory of Attention Phase Transition: A First-Principles Explanation for Copy Head Emergence

The title of this paper is 'Phase Transition in Attention: A Bayesian Theory of Copy Head Emergence', released by the arXiv author team on June 10, 2026 (original link: http://arxiv.org/abs/2606.12058v1). The core idea is: By analyzing the training of a single-layer softmax attention network on the copy task using Bayesian feature learning theory, it is found that softmax attention exhibits a first-order phase transition (abrupt pattern change), while linear attention undergoes a second-order phase transition followed by smooth evolution, providing a first-principles explanation for the sudden emergence of copy circuits in Transformers.

## Research Background: Attention Emergence Phenomenon and the Importance of Copy Heads

The attention mechanism in the Transformer architecture is the core of in-context learning. During training, attention patterns are observed to emerge suddenly rather than evolve gradually, but there is a lack of theoretical explanation. The copy sub-circuit is a key component of the Transformer's induction head, responsible for identifying and copying input sequence patterns, and is the foundation of in-context learning ability. Understanding its formation mechanism is crucial for the learning mechanism of Transformers.

## Theoretical Framework and Research Methods

The research team proposes a Bayesian feature learning theory, treating attention weight learning as a Bayesian inference problem. The study setup involves training a single-layer softmax attention network on the copy task. By deriving the closed-form posterior distribution of the attention matrix, the problem is reduced to a low-dimensional order parameter space for analysis, simplifying the model while retaining core features.

## Core Findings: Phase Transition Phenomena and Comparison of Two Attention Mechanisms

As the amount of training data increases, the system undergoes a phase transition: before the transition, attention is disordered; after the transition point, copy circuits form. Experimental validations (Bayesian sampling and Adam training) consistently support this conclusion. Comparative analysis: Softmax attention exhibits a first-order phase transition (abrupt pattern change, similar to water freezing); linear attention initially undergoes a second-order phase transition (continuous change, similar to Curie temperature transition) followed by smooth evolution. The nonlinearity of softmax leads to discontinuous phase transition, explaining the sudden emergence of patterns.

## Connection to Large Language Models and Theoretical Contributions

Implications for large models: Emergent abilities may be related to phase transitions of attention heads, and there exists a critical data volume threshold; the theoretical framework can predict the timing of ability emergence. Theoretical contributions: Provides a first-principles framework, low-dimensional reduction technology enables analysis of complex dynamics, and interdisciplinary borrowing of statistical physics phase transition theory to explain neural network behavior.

## Limitations and Future Research Directions

Current limitations: The single-layer network differs from real multi-layer Transformers, only the copy task is analyzed, and some derivations rely on assumptions. Future directions: Extend to multi-layer architectures, analyze more in-context learning tasks, use phase transition theory to guide training strategies (e.g., data scheduling), explore phase transition behaviors of other components, and study critical phenomena at phase transition points (e.g., scaling laws).