A developer is implementing a text-generation model. During the decoding process for each new token, their model first computes a final hidden state vector from the decoder. They then immediately apply a Softmax function to this hidden state vector to get a probability distribution for the next token. Which statement best analyzes the flaw in this approach?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is implementing a text-generation model. During the decoding process for each new token, their model first computes a final hidden state vector from the decoder. They then immediately apply a Softmax function to this hidden state vector to get a probability distribution for the next token. Which statement best analyzes the flaw in this approach?
A Transformer-based language model has a final hidden state dimension of 768 for each token position. The model's vocabulary consists of 50,000 unique tokens. To compute the probability distribution for the next token, the final hidden state vector is multiplied by an output projection matrix before the Softmax function is applied. What must be the dimensions of this output projection matrix?
A Transformer decoder is generating the next token in a sequence. Arrange the following computational steps in the correct order to produce the final probability distribution over the vocabulary.