Policy Proportional to Exponentiated Reward
A policy, denoted as , can be modeled such that the probability of generating an output given an input is proportional to the exponential of a reward function . This fundamental relationship is expressed as: This formulation ensures that outputs with higher reward scores are assigned exponentially higher probabilities, forming the basis for converting learned rewards into a usable probability distribution, which is then typically normalized.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Proportional to Exponentiated Reward
A system for ranking text responses first assigns a numerical reward score to each response, and then calculates a 'worth' value for each response using the formula: worth = exp(reward score). Consider two scenarios:
Scenario 1: Response A has a reward score of 3.0, and Response B has a reward score of 1.0. Scenario 2: Response C has a reward score of 8.0, and Response D has a reward score of 6.0.
How does the ratio of worths (Worth_A / Worth_B) in Scenario 1 compare to the ratio of worths (Worth_C / Worth_D) in Scenario 2?
A system for modeling human preferences assigns a numerical reward score,
r, to a given text response. This score can be positive, negative, or zero. To use these scores in a specific type of ranking probability model, each scorermust be converted into a 'worth' valueαthat is always positive and strictly increases asrincreases. A researcher proposes using the functionα = r² + 0.1for this conversion. Which statement correctly analyzes the suitability of this proposed function?A system models preferences by first assigning a numerical reward score to a response and then converting it to a 'worth' value using the formula:
worth = exp(reward_score). An engineer improves a response, causing its reward score to increase first from 2.0 to 3.0, and then with a further improvement, from 3.0 to 4.0. How does the increase in the response's 'worth' value during the first improvement compare to the increase during the second improvement?
Learn After
Plackett-Luce Selection Probability Formula
Optimal Policy as a Product of Reference Policy and Exponentiated Reward
Worth Function in Plackett-Luce Model
A language model's policy, which determines the probability of generating an output
ygiven an inputx, is structured to be proportional to the exponential of a reward scorer(x, y). For a specific input, two potential outputs have the following reward scores:- Output A: Reward = 3.0
- Output B: Reward = 1.0
Based on this formulation, how does the probability of generating Output A compare to the probability of generating Output B?
Analyzing Language Model Response Probabilities
A language model's policy is designed such that the probability of generating an output is proportional to the exponential of its reward score. If Output Y has a reward score that is exactly double the reward score of Output Z, it means the policy will assign exactly double the probability to Output Y compared to Output Z.