Case Study

Debugging a Parameterized Softmax Layer

A developer is building a language model designed to output probabilities over a vocabulary of 50,000 unique words. The final layer of the model receives a matrix of hidden states, H, with dimensions [16 x 768] (representing 16 tokens, each with a 768-dimensional vector). The developer uses a parameterized Softmax layer, defined by the operation Softmax(H โ‹… W), to get the final output. However, during testing, they encounter a matrix dimension mismatch error during the H โ‹… W multiplication. The weight matrix W was incorrectly initialized with dimensions [768 x 16]. Based on the provided scenario, identify the error in the dimensions of the weight matrix W and state what its correct dimensions should be to produce the desired probability distributions. Explain your reasoning.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science