Formula

Autoregressive Decomposition of the LLM Inference Objective

In large language model inference, the optimal output sequence y^\hat{\mathbf{y}} is the one that maximizes the conditional log-probability given the input x\mathbf{x}. This objective, expressed as finding the argument that maximizes logPr(yx)\log \text{Pr}(\mathbf{y}|\mathbf{x}), can be decomposed using the chain rule of probability. The total log-probability of the output sequence is equivalent to the sum of the conditional log-probabilities of each individual token yiy_i. This is expressed as:

y^=arg maxylogPr(yx)=arg maxyi=1nlogPr(yix,y<i)\hat{\mathbf{y}} = \underset{\mathbf{y}}{\text{arg max}} \log \text{Pr}(\mathbf{y}|\mathbf{x}) = \underset{\mathbf{y}}{\text{arg max}} \sum_{i=1}^{n} \log \text{Pr}(y_i|\mathbf{x}, \mathbf{y}_{<i})

In this formula, x\mathbf{x} represents the entire input sequence and y<i\mathbf{y}_{<i} represents all previously generated output tokens. A more explicit representation of the conditional probability term is Pr(yix0,...,xm,y1,...,yi1)\text{Pr}(y_i|x_0, ..., x_m, y_1, ..., y_{i-1}), where the input sequence is (x0,...,xm)(x_0, ..., x_m) and the preceding output is (y1,...,yi1)(y_1, ..., y_{i-1}). This formulation is the mathematical basis for autoregressive generation.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences