Learn Before
Debugging a Parameterized Softmax Layer
A developer is building a language model designed to output probabilities over a vocabulary of 50,000 unique words. The final layer of the model receives a matrix of hidden states, H, with dimensions [16 x 768] (representing 16 tokens, each with a 768-dimensional vector). The developer uses a parameterized Softmax layer, defined by the operation Softmax(H โ
W), to get the final output. However, during testing, they encounter a matrix dimension mismatch error during the H โ
W multiplication. The weight matrix W was incorrectly initialized with dimensions [768 x 16]. Based on the provided scenario, identify the error in the dimensions of the weight matrix W and state what its correct dimensions should be to produce the desired probability distributions. Explain your reasoning.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Probability Distribution Formula for an Encoder-Softmax Language Model
Output Probability Calculation in Transformer Language Models
Next-Token Probability Calculation in Autoregressive Decoders
A neural network produces a final matrix of hidden state vectors, H, with dimensions [sequence_length ร hidden_dimension]. To generate a probability distribution over a vocabulary of size V for each position in the sequence, a parameterized Softmax layer is used, which computes Softmax(H โ W). What is the primary role and required shape of the weight matrix W in this operation?
Debugging a Parameterized Softmax Layer
A parameterized Softmax layer is used to convert a sequence of hidden state vectors into a sequence of probability distributions over a vocabulary. Arrange the following steps of this process into the correct chronological order.