Learn Before
In a policy optimization algorithm, a ratio comparing the likelihood of an action under a new policy versus an old policy is constrained to stay within the interval [1-ε, 1+ε]. What is the most likely consequence of setting the parameter ε to a very small value (e.g., 0.01)?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Upper-Bound Clipping Function for Policy Ratios
A policy optimization algorithm uses a bounding function,
bound(value, lower_bound, upper_bound), to constrain a ratio of action probabilities. This function clips thevalueto ensure it stays within the interval[lower_bound, upper_bound]. If the ratio value is 1.5, and the interval is defined by a parameterε = 0.2(i.e., the interval is[1 - 0.2, 1 + 0.2]), what is the resulting value after the bounding operation is applied?In a policy optimization algorithm, a ratio comparing the likelihood of an action under a new policy versus an old policy is constrained to stay within the interval
[1-ε, 1+ε]. What is the most likely consequence of setting the parameterεto a very small value (e.g., 0.01)?Applying a Bounding Constraint on Probability Ratios