Learn Before
Optimal Policy as a Product of Reference Policy and Exponentiated Reward
In reinforcement learning, particularly for language model alignment, an optimal policy is defined to be proportional to the product of a reference policy and the exponential of a scaled reward function . The unnormalized probability for an output given an input is given by the expression: This formulation effectively re-weights the probabilities from the reference policy, increasing the likelihood of outputs that receive a higher reward. The parameter acts as a temperature, controlling the influence of the reward function on the final policy. The full policy is obtained by normalizing this expression over all possible outputs .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Plackett-Luce Selection Probability Formula
Optimal Policy as a Product of Reference Policy and Exponentiated Reward
Worth Function in Plackett-Luce Model
A language model's policy, which determines the probability of generating an output
ygiven an inputx, is structured to be proportional to the exponential of a reward scorer(x, y). For a specific input, two potential outputs have the following reward scores:- Output A: Reward = 3.0
- Output B: Reward = 1.0
Based on this formulation, how does the probability of generating Output A compare to the probability of generating Output B?
Analyzing Language Model Response Probabilities
A language model's policy is designed such that the probability of generating an output is proportional to the exponential of its reward score. If Output Y has a reward score that is exactly double the reward score of Output Z, it means the policy will assign exactly double the probability to Output Y compared to Output Z.
Learn After
A team is refining a text-generation model. The final probability of a generated text sequence is proportional to the product of its probability from an initial base model and an exponentiated reward score. The reward's influence is controlled by a scaling parameter, β, in the exponent, where a smaller β gives the reward more weight. The team observes that when they significantly decrease the value of β, the model's outputs become more repetitive and sometimes nonsensical, even though they achieve very high scores from the reward model. Which of the following best explains this behavior?
Diagnosing Language Model Alignment Issues
Justifying the Reference Policy