1Cademy - Policy Optimization Derivation Step

Learn Before

Derivation of the KL Divergence Objective for Policy Optimization

Short Answer

Policy Optimization Derivation Step

A key insight in policy optimization is that minimizing a specific KL divergence is equivalent to maximizing a reward-based objective. Consider the expression for the KL divergence between a learned policy, π_θ(·|x), and an optimal policy, π*(·|x):

KL(π_θ(·|x) || π*(·|x))

Given the definition of the optimal policy as π*(y|x) = (1/Z(x)) * π_ref(y|x) * exp(r(x,y)/β), expand the KL divergence expression. Your final answer should be in terms of the learned policy π_θ, the reference policy π_ref, the reward r(x,y), the scaling factor β, and the normalization term Z(x). Do not drop any terms.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related