1Cademy - From Hidden State to Probability Distribution

Learn Before

Output Probability Calculation in Transformer Language Models

Essay

From Hidden State to Probability Distribution

A decoder-only language model has just computed the final hidden state vector for the last token in an input sequence. Describe, in detail, the sequence of two critical operations that are applied to this vector to generate a probability distribution over the entire vocabulary for predicting the next token. For each operation, identify any associated parameter matrices and explain their purpose and dimensional relationship to the model's hidden size and vocabulary size.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related