1Cademy - Solution to KL Divergence Minimization for Policy Optimization

Learn Before

Simplified Policy Optimization Objective as KL Divergence Minimization

Formula

Solution to KL Divergence Minimization for Policy Optimization

The optimization problem of minimizing the Kullback-Leibler (KL) divergence between a learned policy $\pi_{\theta}$ and an optimal policy $\pi^*$ is uniquely solved when the two probability distributions are identical. Thus, the optimal target policy is defined by equating it to $\pi^*$ , which incorporates the reward-weighted reference policy and the normalization factor. This relationship is formally given by:

$\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big)}{Z(\mathbf{x})}$

Updated 2026-05-03

Contributors are:

Who are from:

References

Learn Before

Related

Learn After