Enforcing Autoregressive Behavior
A language model is designed to generate text sequentially, predicting the next token based only on the tokens that came before it. Explain the specific mechanism within the attention calculation that enforces this rule and describe the resulting structure of the attention weight matrix.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is training an autoregressive language model designed to generate text one word at a time. Due to a configuration error, the attention mechanism is allowed to see all tokens in the input sequence, including those that appear later in the sequence, rather than only the preceding ones. The model trains successfully to a very low loss on its training data. What is the most likely outcome when this trained model is later used to generate new text, starting from a prompt?
Debugging an Autoregressive Model's Attention
Enforcing Autoregressive Behavior