Formula

Policy Divergence Penalty for Language Models

The penalty term in PPO for language models quantifies the divergence between the current policy Prθ\text{Pr}_{\theta} and a reference policy Prθref\text{Pr}_{\theta_{\text{ref}}}. It is defined as the difference in the log-probabilities of generating the response y\mathbf{y} given the prompt x\mathbf{x}: Penalty=logPrθ(yx)logPrθref(yx)\text{Penalty} = \log \text{Pr}_{\theta}(\mathbf{y}|\mathbf{x}) - \log \text{Pr}_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) For autoregressive language models, this can be decomposed exactly into a sum over the tokens in the sequence: Penalty=t=1TlogPrθ(ytx,y<t)t=1TlogPrθref(ytx,y<t)\text{Penalty} = \sum_{t=1}^{T} \log \text{Pr}_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t}) - \sum_{t=1}^{T} \log \text{Pr}_{\theta_{\text{ref}}}(y_t|\mathbf{x}, \mathbf{y}_{<t})

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences