Learn Before
Transformer Blocks and Post-Norm Architecture
Decoder-Only Transformer as a Language Model
Parameterized Softmax Layer
Model Depth (L) in Transformers
Logits in Transformer Language Models
Output Probability Calculation in Transformer Language Models
In a Transformer-based language model, the probability distribution for the next token is computed by a final Softmax layer. This layer takes the output from the last Transformer block, denoted as , and applies a linear transformation using a parameter matrix . This matrix, with dimensions (where is the model dimension and is the vocabulary size), maps the model's hidden states to the vocabulary space. The resulting logits are then passed through the Softmax function to produce a sequence of conditional probability distributions. The formula is: \n\n

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
A transformer block showing all the layers
BERT's Core Architecture
Decoder-Only Transformer as a Language Model
Generalized Formula for Post-Norm Architecture
Pre-Norm Architecture in Transformers
Core Function F(·) in Transformer Sub-layers
Output Probability Calculation in Transformer Language Models
Training Decoder-Only Language Models with Cross-Entropy Loss
Input Representation in Decoder-Only Transformers
Output Probability Calculation in Transformer Language Models
Processing Flow of a Decoder-Only Transformer Language Model
Global Nature of Standard Transformer LLMs
Probability Distribution Formula for an Encoder-Softmax Language Model
Output Probability Calculation in Transformer Language Models
BERT's Core Architecture
Output Probability Calculation in Transformer Language Models
Output Probability Calculation in Transformer Language Models