1Cademy - A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly? [Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]

Learn Before

Causal Self-Attention in Autoregressive Decoders

Multiple Choice

A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?

[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related