Learn Before
Comparison

Comparison of Self-Attention Masking Results

A comparison of the self-attention masking results across causal language modeling, masked language modeling, and permuted language modeling can be visualized using matrices. In these representations, a blue cell at coordinates (i,j)(i,j) signifies valid attention, indicating that the token at position jj attends to the token at position ii. Conversely, a gray cell denotes blocked attention, meaning the token at position jj does not attend to the token at position ii. Additionally, emask\mathbf{e}_{\mathrm{mask}} represents the embedding of the symbol [MASK][\mathrm{MASK}], which combines both the token embedding and the positional embedding.

Image 0

0

1

Updated 2026-04-16

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences