Derivation of the KL Divergence Objective for Policy Optimization
The objective for policy optimization can be framed as minimizing the Kullback-Leibler (KL) divergence between the learned policy, , and the optimal reward-weighted policy, . The objective function is expressed as: Minimizing the KL divergence, , is equivalent to maximizing the log-likelihood of the optimal policy's samples under the learned policy. By substituting the definition of and simplifying, this objective can be transformed into a more practical form that directly involves the reward function.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
PPO Objective for LLM Training
Derivation of the KL Divergence Objective for Policy Optimization
During the policy optimization stage of training a large language model, an engineer observes that the model's outputs are coherent and safe, but they show very little improvement over the initial supervised fine-tuned version and consistently receive mediocre scores from the reward model. Which of the following is the most likely cause of this issue, based on the policy optimization objective function that balances maximizing rewards with a penalty for policy divergence?
Analyzing the Trade-off in Policy Optimization
Analyzing a Modified Policy Optimization Objective
Formula for Soft Prompt Optimization by Minimizing KL Divergence
Derivation of the KL Divergence Objective for Policy Optimization
A machine learning model produces a probability distribution Q over a set of outcomes, aiming to approximate a true data distribution P. During evaluation, you observe that the divergence measure is low, while the reverse measure is high. Based on these results, what is the most likely characteristic of the model's distribution Q?
Calculating Divergence Between Distributions
Choosing a Loss Function for Model Distillation
Derivation of the KL Divergence Objective for Policy Optimization
A language model's behavior is guided by a target probability distribution, π*, which is defined by re-weighting a reference distribution, π_ref, based on a reward score, r(x, y). The relationship is given by the formula: In this formula, β is a positive scalar parameter. Analyze the effect of significantly increasing the value of β. What is the most direct consequence for the target distribution π*?
Critique of a Modified Policy Formulation
Calculating a Target Policy Distribution
Learn After
Simplified Policy Optimization Objective as KL Divergence Minimization
In a derivation showing that a policy optimization objective is equivalent to minimizing KL divergence, the objective is simplified from
arg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) - log Z(x) ]toarg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) ]. Why is it valid to remove thelog Z(x)term during this final simplification?A policy optimization objective can be shown to be equivalent to minimizing a KL divergence. Arrange the following expressions to show the correct logical sequence of this mathematical derivation, starting from the point where the optimal policy has been substituted into the objective.
Policy Optimization Derivation Step