Formula

Objective Function for Policy Optimization

An objective function, denoted as U(x,y;θ)U(\mathbf{x}, \mathbf{y}; \theta), can be formulated to guide the training of a sequence generation model, often in the context of policy optimization. It is calculated by summing the weighted log-probabilities of the policy over each step of the generated sequence. The formula is U(x,y;θ)=t=1TA(x,yt,y<t)logπθ(ytx,y<t)U(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} A(\mathbf{x}, y_t, \mathbf{y}_{<t}) \log \pi_\theta(y_t|\mathbf{x}, \mathbf{y}_{<t}), where πθ\pi_\theta is the model's policy (a probability distribution over outputs) and A()A(\cdot) is a function that assigns a weight or advantage to each step.

Image 0

0

1

Updated 2025-10-09

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences