A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the model's final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing a Language Model's Output Layer
A decoder-only language model has an internal hidden dimension of 768 and a vocabulary of 30,000 unique tokens. After processing an input sequence, the model's final layer of hidden states is multiplied by a weight matrix to produce logits, which are then passed to a final activation function. What must be the dimensions of this weight matrix and what is its primary role in this process?
From Hidden State to Probability Distribution