The attention mechanism is the core of the Transformer architecture, allowing models to dynamically focus on different parts of the input sequence.
Mathematical Essence of Self-Attention
Three steps: Linear transformation to get Query, Key, Value matrices; calculate similarity scores between queries and keys; weighted sum of values using softmax weights.
Multi-Head Attention
Split into multiple "heads", each head learns different attention patterns and captures various linguistic phenomena such as syntax and semantics simultaneously.
Positional Encoding
Inject sequence position information; the original Transformer uses sine and cosine functions, while modern models like RoPE adopt rotational positional encoding, which performs better on long sequences.
Causal Masking and Autoregressive Generation
In generation tasks, causal masking is used to block future position information, ensuring that predicting the nth token only uses the first n-1 tokens, supporting autoregressive generation capabilities.