In a Transformer-based language model, the probability distribution for the next token is computed by a final Softmax layer. This layer takes the output from the last Transformer block, denoted as $\mathbf{H}^L$, and applies a linear transformation using a parameter matrix $\mathbf{W}^o$. This matrix, with dimensions $d \times |V|$ (where $d$ is the model dimension and $|V|$ is the vocabulary size), maps the model's hidden states to the vocabulary space. The resulting logits are then passed through the Softmax function to produce a sequence of conditional probability distributions. The formula is: \n\n$$ \begin{bmatrix} \text{Pr}(\cdot|x_0, ..., x_{m-1}) \\ \vdots \\ \text{Pr}(\cdot|x_0, x_1) \\ \text{Pr}(\cdot|x_0) \end{bmatrix} = \text{Softmax}(\mathbf{H}^L \mathbf{W}^o) $$

Google

A Transformer block is the fundamental component of a Transformer network. Each block typically contains two main sub-layers: a self-attention module and a feed-forward network (FFN). These sub-layers are often structured using a post-norm architecture, which involves a residual connection followed by layer normalization. The operation can be generalized as `output = LNorm(F(input) + input)`, where `F` is the function of the sub-layer (e.g., self-attention or FFN). This design, where the input and output dimensions are matched, allows multiple blocks to be stacked to create deep networks. The specific operations for each sub-layer are: \n$z = LayerNorm(x+SelfAttn(x))$ \n$y = LayerNorm(z+FFNN(z))$

Transformer Blocks and Post-Norm Architecture

The decoder-only Transformer, a highly popular architecture for Large Language Models (LLMs), is an adaptation of the standard Transformer's decoder. It functions as an auto-regressive language model by removing the cross-attention sub-layers, enabling it to predict the next token based solely on the preceding sequence. The model's main body is composed of a stack of Transformer blocks, which process an input sequence represented by vectors derived from token and positional embeddings.

Decoder-Only Transformer as a Language Model

A parameterized Softmax layer, often denoted as $\text{Softmax}_W(\cdot)$, incorporates a set of weights, $W$. This layer operates by first applying a linear transformation to the input, $H$, using the weight matrix $W$, and then passing the result through the standard Softmax function. The operation is formally defined as $\text{Softmax}_W(H) = \text{Softmax}(H \cdot W)$.

Parameterized Softmax Layer

Model depth, denoted by the hyperparameter L, specifies the number of layers stacked within a Transformer architecture. Increasing the model's depth is a key method for boosting its expressive power. For example, BERT models typically use a depth of L=12 for the base version and L=24 for the large version, and even deeper networks can be constructed for further improvements.

Model Depth (L) in Transformers

In Transformer-based language models, logits are the raw, unnormalized scores that are output by the model's final linear layer before the application of a Softmax function. They are represented as a sequence of vectors, $\{z_0, ..., z_{m-1}\}$, where each vector corresponds to a token position in the sequence. These vectors are generated by projecting the final hidden states ($h^L$) into the vocabulary space, with each element in a vector representing the score for a potential token.

Logits in Transformer Language Models

Reference of Foundations of Large Language Models Course

A transformer block showing all the layers

The fundamental structure of a BERT model is a multi-layer Transformer network. This deep network is created by stacking numerous Transformer layers. The final output from the network's last layer is a sequence of real-valued vectors, where each vector corresponds to a position in the input sequence.

BERT's Core Architecture

The operation within a sub-layer of a Transformer block using the post-norm architecture is generalized by the formula: `output = LNorm(F(input) + input)`. In this equation, `F` represents the function of the sub-layer (e.g., self-attention or a feed-forward network), `input` is the input to the sub-layer, and `LNorm` stands for Layer Normalization. This structure applies a residual connection by adding the input to the function's output before normalization.

Generalized Formula for Post-Norm Architecture

The pre-norm architecture is an alternative design for Transformer sub-layers where Layer Normalization (`LNorm`) is applied to the sub-layer's function output *before* the residual connection. This approach can enhance training stability in deep networks. The operation is defined by the formula: $$ \text{output} = \text{LNorm}(F(\text{input})) + \text{input} $$ In this context, both `input` and `output` are represented as $m \times d$ matrices, where $m$ is the sequence length and $d$ is the representation dimension. Each row in these matrices corresponds to the contextual representation of a specific token in the sequence.

Pre-Norm Architecture in Transformers

In the context of a Transformer sub-layer's generalized formula, `F(·)` represents the core operation. When the sub-layer is a Feed-Forward Network (FFN), `F(·)` is a multi-layer FFN. When the sub-layer is for self-attention, `F(·)` is a multi-head self-attention function, which is generally expressed as a Query-Key-Value (QKV) attention mechanism.

Core Function F(·) in Transformer Sub-layers

Output Probability Calculation in Transformer Language Models

The standard training approach for a decoder-only language model is to minimize a loss function over a set of text sequences. For a specific sequence of $m$ tokens, \{x_0, ..., x_m\}, the total loss is computed by summing the individual losses at each position from $0$ to $m-1$. At every position, the model's predicted probability distribution for the subsequent token is compared against the 'gold-standard' distribution, which is typically a one-hot vector representing the actual word. The log-scale cross-entropy loss is the standard function used for this comparison at each step.

Training Decoder-Only Language Models with Cross-Entropy Loss

In a decoder-only Transformer, an input sequence of tokens, \{x_0, ..., x_{m-1}\}, is transformed into a sequence of $d_e$-dimensional vectors, \{e_0, ..., e_{m-1}\}. Each vector $e_i$ is generated by summing the token embedding of the corresponding token $x_i$ with the positional embedding for position $i$.

Input Representation in Decoder-Only Transformers

The operation of a decoder-only Transformer language model involves a sequential process. First, an input sequence of tokens ($x_0, x_1, ..., x_{m-1}$) is converted into a sequence of embeddings ($e_0, e_1, ..., e_{m-1}$). This embedding sequence is then processed by a stack of $L$ Transformer blocks, which utilize self-attention and feed-forward networks with either post-norm or pre-norm architectures. The final Transformer block produces a sequence of hidden state vectors ($h^L_0, h^L_1, ..., h^L_{m-1}$). These vectors are then projected into the vocabulary space to produce logits ($z_0, z_1, ..., z_{m-1}$), which are subsequently passed through a Softmax function. The output is a series of conditional probability distributions, such as $\text{Pr}(x_1|x_0)$ and $\text{Pr}(x_2|x_0, x_1)$, enabling the model to predict the next token in an auto-regressive manner.

Processing Flow of a Decoder-Only Transformer Language Model

Large Language Models (LLMs) that utilize the standard Transformer architecture are classified as global models. This is because their inference process requires access to the entire preceding context, or 'left-context,' to generate future tokens. Consequently, they must store the key and value representations for all previously generated tokens, a task typically handled by a KV cache.

Global Nature of Standard Transformer LLMs

The probability distribution over the vocabulary for each token in a sequence can be computed using a pre-trained language model. This is achieved by first passing the input sequence, $\mathbf{x}$, through an encoder with parameters $\theta$ to generate a sequence of representations. A Softmax layer, with its own set of weights $W$, is then applied to these representations to produce a sequence of probability vectors, $[\mathbf{p}_{1}, ..., \mathbf{p}_{m}]$. Each vector $\mathbf{p}_i$ represents the probability distribution for the token at position $i$. The formula is expressed as:

$$ \begin{bmatrix} \mathbf{p}_{1}^{W,\theta} \\ \vdots \\ \mathbf{p}_{m}^{W,\theta} \end{bmatrix} = \text{Softmax}_W(\text{Encoder}_\theta(\mathbf{x})) $$

Learn Before

Related