1Cademy - Policy Divergence Penalty for Language Models

Learn Before

Formula

Policy Divergence Penalty for Language Models

The penalty term in PPO for language models quantifies the divergence between the current policy $\text{Pr}_{\theta}$ and a reference policy $\text{Pr}_{\theta_{\text{ref}}}$ . It is defined as the difference in the log-probabilities of generating the response $\mathbf{y}$ given the prompt $\mathbf{x}$ : $\text{Penalty} = \log \text{Pr}_{\theta}(\mathbf{y}|\mathbf{x}) - \log \text{Pr}_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x})$ For autoregressive language models, this can be decomposed exactly into a sum over the tokens in the sequence: $\text{Penalty} = \sum_{t=1}^{T} \log \text{Pr}_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t}) - \sum_{t=1}^{T} \log \text{Pr}_{\theta_{\text{ref}}}(y_t|\mathbf{x}, \mathbf{y}_{<t})$

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After