1Cademy - An autoregressive model is processing a sequence of 4 tokens. To ensure that the prediction for any given token is based only on the tokens that came before it and the token itself, a specific structure is imposed on the attention weight matrix. Which of the following 4x4 matrices correctly illustrates this structure, where α represents a calculated, non-zero attention weight and 0 represents a weight that has been forcibly set to zero?

Learn Before

Causal Attention Weight Matrix Calculation

Multiple Choice

An autoregressive model is processing a sequence of 4 tokens. To ensure that the prediction for any given token is based only on the tokens that came before it and the token itself, a specific structure is imposed on the attention weight matrix. Which of the following 4x4 matrices correctly illustrates this structure, where 'α' represents a calculated, non-zero attention weight and '0' represents a weight that has been forcibly set to zero?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related