Learn Before
A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?
[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Next-Token Probability Calculation in Autoregressive Decoders
Enumeration of Dot Products in Causal Self-Attention
A language model is designed to generate text one token at a time, predicting the next token based only on the ones that came before it. The image below shows four possible heatmaps (A, B, C, D) representing the attention scores between tokens in a 4-token sequence. The token making the query is on the vertical axis, and the token providing the key is on the horizontal axis. A darker square indicates that a query token is paying more attention to a key token. Which heatmap correctly illustrates the attention pattern required for this type of sequential generation model to function correctly?
[Image containing four 4x4 heatmaps labeled A, B, C, and D. A: A lower-triangular matrix, dark on and below the main diagonal. B: A full matrix, all squares are dark. C: An upper-triangular matrix, dark on and above the main diagonal. D: A diagonal matrix, dark only on the main diagonal.]
Debugging a Generative Language Model
Example of Causal Attention Dot Products
Choosing the Right Attention Mechanism