Essay

From Hidden State to Probability Distribution

A decoder-only language model has just computed the final hidden state vector for the last token in an input sequence. Describe, in detail, the sequence of two critical operations that are applied to this vector to generate a probability distribution over the entire vocabulary for predicting the next token. For each operation, identify any associated parameter matrices and explain their purpose and dimensional relationship to the model's hidden size and vocabulary size.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science