Diagnosing a Language Model's Output Layer
Based on the case study, describe the two essential, sequential operations that must be applied to the model's final hidden state matrix to convert it into the desired set of probability distributions over the vocabulary. For each operation, specify its purpose and the resulting shape of the data.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing a Language Model's Output Layer
A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the model's final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?
From Hidden State to Probability Distribution