1Cademy - Objective Function for Policy Optimization

Learn Before

Policy Notation for Autoregressive Models (π_θ)

Formula

Objective Function for Policy Optimization

An objective function, denoted as $U(\mathbf{x}, \mathbf{y}; \theta)$ , can be formulated to guide the training of a sequence generation model, often in the context of policy optimization. It is calculated by summing the weighted log-probabilities of the policy over each step of the generated sequence. The formula is $U(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} A(\mathbf{x}, y_t, \mathbf{y}_{<t}) \log \pi_\theta(y_t|\mathbf{x}, \mathbf{y}_{<t})$ , where $\pi_\theta$ is the model's policy (a probability distribution over outputs) and $A(\cdot)$ is a function that assigns a weight or advantage to each step.

0

1

Updated 2025-10-09

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After