Short Answer

Applying a Bounding Constraint on Probability Ratios

In a reinforcement learning algorithm, a ratio comparing the probability of an action under a new policy to an old policy is constrained to stay within a specific interval to ensure training stability. This interval is defined as [1 - ε, 1 + ε]. If the constraint parameter ε is set to 0.25, what would be the final constrained values for the following two independently calculated ratios?

  1. Initial Ratio: 1.40
  2. Initial Ratio: 0.65

Provide the final value for each case.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science