Learn Before
  • Transformer Blocks and Post-Norm Architecture

  • Decoder-Only Transformer as a Language Model

  • Parameterized Softmax Layer

  • Model Depth (L) in Transformers

  • Logits in Transformer Language Models

Output Probability Calculation in Transformer Language Models

In a Transformer-based language model, the probability distribution for the next token is computed by a final Softmax layer. This layer takes the output from the last Transformer block, denoted as HL\mathbf{H}^L, and applies a linear transformation using a parameter matrix Wo\mathbf{W}^o. This matrix, with dimensions d×Vd \times |V| (where dd is the model dimension and V|V| is the vocabulary size), maps the model's hidden states to the vocabulary space. The resulting logits are then passed through the Softmax function to produce a sequence of conditional probability distributions. The formula is: \n\n[Pr(x0,...,xm1)Pr(x0,x1)Pr(x0)]=Softmax(HLWo)\begin{bmatrix} \text{Pr}(\cdot|x_0, ..., x_{m-1}) \\ \vdots \\ \text{Pr}(\cdot|x_0, x_1) \\ \text{Pr}(\cdot|x_0) \end{bmatrix} = \text{Softmax}(\mathbf{H}^L \mathbf{W}^o)

Image 0

0

1

13 days ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • A transformer block showing all the layers

  • BERT's Core Architecture

  • Decoder-Only Transformer as a Language Model

  • Generalized Formula for Post-Norm Architecture

  • Pre-Norm Architecture in Transformers

  • Core Function F(·) in Transformer Sub-layers

  • Output Probability Calculation in Transformer Language Models

  • Training Decoder-Only Language Models with Cross-Entropy Loss

  • Input Representation in Decoder-Only Transformers

  • Output Probability Calculation in Transformer Language Models

  • Processing Flow of a Decoder-Only Transformer Language Model

  • Global Nature of Standard Transformer LLMs

  • Probability Distribution Formula for an Encoder-Softmax Language Model

  • Output Probability Calculation in Transformer Language Models

  • BERT's Core Architecture

  • Output Probability Calculation in Transformer Language Models

  • Output Probability Calculation in Transformer Language Models