Concept

Decoder-Only Transformer as a Language Model

The decoder-only Transformer architecture is a prevalent design for Large Language Models (LLMs). It is typically created by modifying a standard Transformer decoder, specifically by eliminating the cross-attention sub-layers. The central components of this architecture are LL stacked Transformer blocks, each comprising a self-attention sub-layer and a feed-forward network (FFN) sub-layer. To prevent the model from accessing the right-context, a masking variable is incorporated into the self-attention mechanism. Finally, the output layer uses a Softmax function to generate a probability distribution for the next token, given the sequence of previous tokens, enabling auto-regressive generation.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.2 Generative Models - Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models