From Hidden State to Probability Distribution
A decoder-only language model has just computed the final hidden state vector for the last token in an input sequence. Describe, in detail, the sequence of two critical operations that are applied to this vector to generate a probability distribution over the entire vocabulary for predicting the next token. For each operation, identify any associated parameter matrices and explain their purpose and dimensional relationship to the model's hidden size and vocabulary size.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing a Language Model's Output Layer
A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the model's final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?
From Hidden State to Probability Distribution