1Cademy - KL-Divergence Penalty in RLHF Policy Optimization

Learn Before

Objective Function for Policy Learning in RLHF
Reference Policy in RLHF

Formula

KL-Divergence Penalty in RLHF Policy Optimization

A penalty term is incorporated into the RLHF objective function to regularize the policy and prevent it from deviating excessively from a reference policy. This penalty is formulated as the difference between the log probabilities of a sequence under the current policy ( $\theta$ ) and the reference policy ( $\theta_{ref}$ ), summed over all tokens in the sequence. The formula is: $Penalty = \log Pr_{\theta}(y|x) - \log Pr_{\theta_{ref}}(y|x) = \sum_{t=1}^{T} \log Pr_{\theta}(y_t|x, y_{<t}) - \sum_{t=1}^{T} \log Pr_{\theta_{ref}}(y_t|x, y_{<t})$ .