Multiple Choice

A neural network produces a final matrix of hidden state vectors, H, with dimensions [sequence_length × hidden_dimension]. To generate a probability distribution over a vocabulary of size V for each position in the sequence, a parameterized Softmax layer is used, which computes Softmax(H ⋅ W). What is the primary role and required shape of the weight matrix W in this operation?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science