Detailed Explanation of the Visualization Process
1. Tokenization Stage
Uses Byte Pair Encoding (BPE) to split text into subword units. The visualized content includes a comparison between the original text and tokenization results, the vocabulary IDs corresponding to tokens, special token identifiers, and highlighted tokenization boundaries.
2. Embedding Layer
Converts the token sequence into high-dimensional vectors: word embedding (mapping discrete tokens to continuous vectors) + positional encoding (adding positional information). The final input is the element-wise sum of the two.
3. Transformer Layer
Displays the multi-layer decoder structure. Each layer includes masked self-attention (calculating positional correlation to prevent future information leakage), layer normalization, feed-forward network, and residual connections. Attention weights are presented as heatmaps to intuitively show the contextual information the model focuses on.
4. Output Generation
The model outputs the vocabulary probability distribution: the LM head maps the hidden states to logits, which are normalized via Softmax, then the next token is generated using strategies like Top-k or nucleus sampling.