Learn Before
Worth Function in Plackett-Luce Model
In the Plackett-Luce model, a 'worth' value, denoted as , is assigned to each possible response . This value is defined as the exponential of a reward score , which is associated with generating response given an input . The formula is:
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Plackett-Luce Selection Probability Formula
Optimal Policy as a Product of Reference Policy and Exponentiated Reward
Worth Function in Plackett-Luce Model
A language model's policy, which determines the probability of generating an output
ygiven an inputx, is structured to be proportional to the exponential of a reward scorer(x, y). For a specific input, two potential outputs have the following reward scores:- Output A: Reward = 3.0
- Output B: Reward = 1.0
Based on this formulation, how does the probability of generating Output A compare to the probability of generating Output B?
Analyzing Language Model Response Probabilities
A language model's policy is designed such that the probability of generating an output is proportional to the exponential of its reward score. If Output Y has a reward score that is exactly double the reward score of Output Z, it means the policy will assign exactly double the probability to Output Y compared to Output Z.
Learn After
Plackett-Luce Selection Probability Formula
A system assigns a 'worth' value to potential text completions, calculated as the exponential of a reward score. Initially, three completions (A, B, C) have reward scores of 2.0, 3.0, and 4.0, respectively. If the reward score for each completion is increased by a constant value of 1.0, how does this change affect the ratio of worth between any two completions (e.g., the ratio of worth(B) to worth(A))?
Calculating Response Worth for an AI Assistant
In a system that assigns a 'worth' value to a response by taking the exponential of its reward score, doubling the reward score for a response will also double its assigned worth value.