Learn Before
Concept

Left-to-Right Factorization in Sequence Models

While the joint probability of a sequence P(x1,,xT)P(x_1, \ldots, x_T) can be mathematically factorized in reverse (right-to-left) or any random order, left-to-right (in-order) factorization is generally preferred for language modeling. First, it aligns with the natural human intuition of anticipating upcoming words while reading. Second, factorizing in order allows the same language model to easily assign probabilities to arbitrarily long sequences by continually multiplying the current probability by the conditional probability of the next token: P(xt+1,,x1)=P(xt,,x1)P(xt+1xt,,x1)P(x_{t+1}, \ldots, x_1) = P(x_{t}, \ldots, x_1) \cdot P(x_{t+1} \mid x_{t}, \ldots, x_1). Third, for causally structured data where future events cannot influence the past, predicting forward (P(xt+1xt)P(x_{t+1} \mid x_t)) is usually an easier predictive modeling problem than predicting backward (P(xtxt+1)P(x_t \mid x_{t+1})).

0

1

Updated 2026-05-13

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L