1Cademy - Derivation of the KL Divergence Objective for Policy Optimization

Learn Before

Formula

Derivation of the KL Divergence Objective for Policy Optimization

The objective for policy optimization can be framed as minimizing the Kullback-Leibler (KL) divergence between the learned policy, $\pi_{\theta}(\mathbf{y}|\mathbf{x})$ , and the optimal reward-weighted policy, $\pi^{*}(\mathbf{y}|\mathbf{x})$ . The objective function is expressed as: $\max_{\theta} \mathbb{E}_{\mathbf{x} \sim D} \left[ \mathbb{E}_{\mathbf{y} \sim \pi^{*}(\cdot|\mathbf{x})} [\log \pi_{\theta}(\mathbf{y}|\mathbf{x})] \right]$ Minimizing the KL divergence, $\text{KL}(\pi^{*} || \pi_{\theta})$ , is equivalent to maximizing the log-likelihood of the optimal policy's samples under the learned policy. By substituting the definition of $\pi^{*}$ and simplifying, this objective can be transformed into a more practical form that directly involves the reward function.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After