Simplified Policy Optimization Objective as KL Divergence Minimization
The policy optimization objective can be mathematically simplified to finding the parameters that minimize the expected Kullback-Leibler (KL) divergence between the learned target policy and the optimal target distribution . This simplification is mathematically sound because the normalization term, , is independent of the optimization variable and can therefore be removed from the operation without altering the optimal parameters. The simplified training objective is expressed as:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Simplified Policy Optimization Objective as KL Divergence Minimization
In a derivation showing that a policy optimization objective is equivalent to minimizing KL divergence, the objective is simplified from
arg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) - log Z(x) ]toarg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) ]. Why is it valid to remove thelog Z(x)term during this final simplification?A policy optimization objective can be shown to be equivalent to minimizing a KL divergence. Arrange the following expressions to show the correct logical sequence of this mathematical derivation, starting from the point where the optimal policy has been substituted into the objective.
Policy Optimization Derivation Step
Learn After
Solution to KL Divergence Minimization for Policy Optimization
When optimizing a policy π_θ to match an optimal policy π*, the objective function is often simplified from Objective A to Objective B:
Objective A:
arg min_θ Eₓ[KL(π_θ(·|x) || π*(·|x)) - log Z(x)]Objective B:arg min_θ Eₓ[KL(π_θ(·|x) || π*(·|x))]What is the fundamental mathematical reason this simplification is valid?
Efficiency in Policy Optimization Implementation
Justification for Simplification in Policy Optimization