When using the formula Score(y) = π(y|x) * exp(r(x, y)) to adjust the likelihood of a potential output y, setting the reward r(x, y) to zero will cause the final score for that output to become zero, effectively eliminating it from consideration.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Re-weighting a Reference Probability Distribution with a Scaled Reward
A language model is generating a completion for an input
x. The model has a base probability distribution,π(y|x), for four potential completions (y). To steer the model's output, a reward function,r(x, y), is applied to create a new unnormalized score for each completion using the formula:Score(y) = π(y|x) * exp(r(x, y)). Given the values below, which completion will have the highest score?When using the formula
Score(y) = π(y|x) * exp(r(x, y))to adjust the likelihood of a potential outputy, setting the rewardr(x, y)to zero will cause the final score for that output to become zero, effectively eliminating it from consideration.Steering Language Model Output for Slogan Generation