Concept

Causal Language Modeling

Causal language modeling, also known as standard language modeling, is an auto-regressive pre-training approach where tokens are sequentially predicted following their natural, fixed order in the text (typically left-to-right). For instance, a sequence of 5{}5 tokens x0x1x2x3x4x_0 x_1 x_2 x_3 x_4 is generated in the order x0x1x2x3x4x_0 \to x_1 \to x_2 \to x_3 \to x_4. The overall sequence probability Pr(x)\Pr(\mathbf{x}) is the product of individual token probabilities conditioned on preceding tokens: Pr(x0)Pr(x1x0)Pr(x2x0,x1)Pr(x3x0,x1,x2)Pr(x4x0,x1,x2,x3)\Pr(x_0) \cdot \Pr(x_1|x_0) \cdot \Pr(x_2|x_0,x_1) \cdot \Pr(x_3|x_0,x_1,x_2) \cdot \Pr(x_4|x_0,x_1,x_2,x_3). By substituting ei\mathbf{e}_i as the embedding for token xix_i (a combination of its token and positional embeddings), the generation process is modeled as: Pr(x0)Pr(x1e0)Pr(x2e0,e1)Pr(x3e0,e1,e2)Pr(x4e0,e1,e2,e3)\Pr(x_0) \cdot \Pr(x_1|\mathbf{e}_0) \cdot \Pr(x_2|\mathbf{e}_0,\mathbf{e}_1) \cdot \Pr(x_3|\mathbf{e}_0,\mathbf{e}_1, \mathbf{e}_2) \cdot \Pr(x_4|\mathbf{e}_0, \mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_3). This demonstrates that each prediction depends solely on past context.

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After