Formula

KL-Divergence Penalty in RLHF Policy Optimization

A penalty term is incorporated into the RLHF objective function to regularize the policy and prevent it from deviating excessively from a reference policy. This penalty is formulated as the difference between the log probabilities of a sequence under the current policy (θ\theta) and the reference policy (θref\theta_{ref}), summed over all tokens in the sequence. The formula is: Penalty=logPrθ(yx)logPrθref(yx)=t=1TlogPrθ(ytx,y<t)t=1TlogPrθref(ytx,y<t)Penalty = \log Pr_{\theta}(y|x) - \log Pr_{\theta_{ref}}(y|x) = \sum_{t=1}^{T} \log Pr_{\theta}(y_t|x, y_{<t}) - \sum_{t=1}^{T} \log Pr_{\theta_{ref}}(y_t|x, y_{<t}).

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Learn After