Re-weighting a Reference Probability Distribution with a Scaled Reward
The formula represents a method for adjusting a probability distribution from a reference model, denoted by . The term is the base probability of generating output from input according to the reference model parameterized by . This probability is then scaled by the exponential of a reward function , which is itself scaled by an inverse temperature parameter, . The temperature controls the extent to which the reward influences the final probability, with smaller values of amplifying the effect of the reward.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Re-weighting a Reference Probability Distribution with a Scaled Reward
A language model is generating a completion for an input
x. The model has a base probability distribution,π(y|x), for four potential completions (y). To steer the model's output, a reward function,r(x, y), is applied to create a new unnormalized score for each completion using the formula:Score(y) = π(y|x) * exp(r(x, y)). Given the values below, which completion will have the highest score?When using the formula
Score(y) = π(y|x) * exp(r(x, y))to adjust the likelihood of a potential outputy, setting the rewardr(x, y)to zero will cause the final score for that output to become zero, effectively eliminating it from consideration.Steering Language Model Output for Slogan Generation
Learn After
An AI text generation system adjusts the likelihood of different outputs using the formula: New_Likelihood = Base_Likelihood * exp((1/β) * Reward). In this formula, 'Base_Likelihood' is the initial probability from a reference model, 'Reward' is a score for the output's quality, and 'β' is a positive 'temperature' parameter. A team wants to use this system to generate a diverse set of creative, high-quality story endings. They are comparing two settings for the temperature parameter: β = 0.5 and β = 2.0. Which setting should they choose to better achieve their goal, and why?
Tuning a Generative Model for Different Tasks
Effect of Temperature Scaling on a Reward-Modified Distribution