Example

Schematic of Probability Calculation in Causal Language Modeling

This schematic illustrates the sequential probability calculation in Causal Language Modeling, a type of auto-regressive model. For a sequence x0,x1,...,x4x_0, x_1, ..., x_4, the model predicts each token based on the embeddings of the tokens that came before it. The process begins by setting the probability of the first token, Pr(x0)Pr(x_0), to 1. Each subsequent token's probability is then conditioned on the embeddings of all prior tokens, as shown in the diagram below. This unidirectional, step-by-step dependency is a core feature of causal language models.

Token: x0 x1 x2 x3 x4 ↓ ↓ ↓ ↓ ↓ Probability: Pr(x0)=1 Pr(x1|e0) Pr(x2|e0, e1) Pr(x3|e0, e1, e2) Pr(x4|e0, e1, e2, e3)

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Related