Formula

Solution to KL Divergence Minimization for Policy Optimization

The optimization problem of minimizing the Kullback-Leibler (KL) divergence between a learned policy πθ\pi_{\theta} and an optimal policy π\pi^* is uniquely solved when the two probability distributions are identical. Thus, the optimal target policy is defined by equating it to π\pi^*, which incorporates the reward-weighted reference policy and the normalization factor. This relationship is formally given by:

πθ(yx)=π(yx)=πθref(yx)exp(1βr(x,y))Z(x)\pi_{\theta}(\mathbf{y}|\mathbf{x}) = \pi^{*}(\mathbf{y}|\mathbf{x}) = \frac{\pi_{\theta_{\mathrm{ref}}}(\mathbf{y}|\mathbf{x}) \exp \big(\frac{1}{\beta} r(\mathbf{x},\mathbf{y}) \big)}{Z(\mathbf{x})}

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After