Learn Before
In the formula for a reward-weighted probability distribution, the parameter β acts as a temperature or inverse scaling factor. How does decreasing the value of β (i.e., moving it closer to 0, but remaining positive) affect the final distribution π*?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
In the formula for a reward-weighted probability distribution, the parameter
βacts as a temperature or inverse scaling factor. How does decreasing the value ofβ(i.e., moving it closer to 0, but remaining positive) affect the final distributionπ*?Applying a Reward Function to a Language Model's Output
Target Policy as a Reward-Weighted Distribution
In the context of a reward-weighted probability distribution, defined as , consider a scenario where a specific output, , receives a very high reward, . However, the reference distribution assigns a probability to this output that is extremely close to zero, i.e., . What will be the approximate probability of in the final distribution, ?