1Cademy - A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the models final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?

Learn Before

Output Probability Calculation in Transformer Language Models

Multiple Choice

A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the model's final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related