Formula

Re-weighting a Reference Probability Distribution with a Scaled Reward

The formula πθref(yx)exp(1βr(x,y))\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) \exp(\frac{1}{\beta}r(\mathbf{x}, \mathbf{y})) represents a method for adjusting a probability distribution from a reference model, denoted by πθref\pi_{\theta_{\text{ref}}}. The term πθref(yx)\pi_{\theta_{\text{ref}}}(\mathbf{y}|\mathbf{x}) is the base probability of generating output y\mathbf{y} from input x\mathbf{x} according to the reference model parameterized by θref\theta_{\text{ref}}. This probability is then scaled by the exponential of a reward function r(x,y)r(\mathbf{x}, \mathbf{y}), which is itself scaled by an inverse temperature parameter, 1β\frac{1}{\beta}. The temperature β\beta controls the extent to which the reward influences the final probability, with smaller values of β\beta amplifying the effect of the reward.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences