1Cademy - An autoregressive model calculates a square attention weight matrix using the formula: `Softmax((QK^T / sqrt(d)) + Mask)`. The purpose of the `Mask` component is to prevent any token from attending to subsequent tokens in the sequence. Which statement best describes the resulting attention weight matrix?

Learn Before

Causal Attention Weight Matrix Calculation

Multiple Choice

An autoregressive model calculates a square attention weight matrix using the formula: Softmax((QK^T / sqrt(d)) + Mask). The purpose of the Mask component is to prevent any token from attending to subsequent tokens in the sequence. Which statement best describes the resulting attention weight matrix?

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related