Learn Before
Target Policy as a Reward-Weighted Distribution
In policy optimization frameworks like RLHF, the target policy π_θ that is being learned is defined as being equal to an optimal distribution π*. This optimal distribution is created by re-weighting a reference policy π_{θ_ref} according to a reward function r(x, y). The complete relationship is expressed by the formula: This equation establishes the ideal policy that the model, parameterized by θ, aims to learn, balancing adherence to the reference model with maximization of the reward.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
In the formula for a reward-weighted probability distribution, the parameter
βacts as a temperature or inverse scaling factor. How does decreasing the value ofβ(i.e., moving it closer to 0, but remaining positive) affect the final distributionπ*?Applying a Reward Function to a Language Model's Output
Target Policy as a Reward-Weighted Distribution
In the context of a reward-weighted probability distribution, defined as , consider a scenario where a specific output, , receives a very high reward, . However, the reference distribution assigns a probability to this output that is extremely close to zero, i.e., . What will be the approximate probability of in the final distribution, ?
Learn After
Derivation of the KL Divergence Objective for Policy Optimization
A language model's behavior is guided by a target probability distribution, π*, which is defined by re-weighting a reference distribution, π_ref, based on a reward score, r(x, y). The relationship is given by the formula: In this formula, β is a positive scalar parameter. Analyze the effect of significantly increasing the value of β. What is the most direct consequence for the target distribution π*?
Critique of a Modified Policy Formulation
Calculating a Target Policy Distribution