Formula

Derivation of the KL Divergence Objective for Policy Optimization

The objective for policy optimization can be framed as minimizing the Kullback-Leibler (KL) divergence between the learned policy, πθ(yx)\pi_{\theta}(\mathbf{y}|\mathbf{x}), and the optimal reward-weighted policy, π(yx)\pi^{*}(\mathbf{y}|\mathbf{x}). The objective function is expressed as: maxθExD[Eyπ(x)[logπθ(yx)]]\max_{\theta} \mathbb{E}_{\mathbf{x} \sim D} \left[ \mathbb{E}_{\mathbf{y} \sim \pi^{*}(\cdot|\mathbf{x})} [\log \pi_{\theta}(\mathbf{y}|\mathbf{x})] \right] Minimizing the KL divergence, KL(ππθ)\text{KL}(\pi^{*} || \pi_{\theta}), is equivalent to maximizing the log-likelihood of the optimal policy's samples under the learned policy. By substituting the definition of π\pi^{*} and simplifying, this objective can be transformed into a more practical form that directly involves the reward function.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related