In a derivation showing that a policy optimization objective is equivalent to minimizing KL divergence, the objective is simplified from arg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) - log Z(x) ] to arg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) ]. Why is it valid to remove the log Z(x) term during this final simplification?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Simplified Policy Optimization Objective as KL Divergence Minimization
In a derivation showing that a policy optimization objective is equivalent to minimizing KL divergence, the objective is simplified from
arg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) - log Z(x) ]toarg min_θ E_{x~D} [ KL(π_θ(·|x) || π*(·|x)) ]. Why is it valid to remove thelog Z(x)term during this final simplification?A policy optimization objective can be shown to be equivalent to minimizing a KL divergence. Arrange the following expressions to show the correct logical sequence of this mathematical derivation, starting from the point where the optimal policy has been substituted into the objective.
Policy Optimization Derivation Step