Short Answer

Policy Optimization Derivation Step

A key insight in policy optimization is that minimizing a specific KL divergence is equivalent to maximizing a reward-based objective. Consider the expression for the KL divergence between a learned policy, π_θ(·|x), and an optimal policy, π*(·|x):

KL(π_θ(·|x) || π*(·|x))

Given the definition of the optimal policy as π*(y|x) = (1/Z(x)) * π_ref(y|x) * exp(r(x,y)/β), expand the KL divergence expression. Your final answer should be in terms of the learned policy π_θ, the reference policy π_ref, the reward r(x,y), the scaling factor β, and the normalization term Z(x). Do not drop any terms.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science