Learn Before
Policy Probability Ratio (Ratio Function)
The policy probability ratio, also known as the ratio function, evaluates the difference between a current policy () and a previous or reference policy () for a given state-action pair. It is determined by dividing the probability of an action under the current policy by its probability under the reference policy. By employing the ratio function, observed rewards can be reweighted based on the likelihood of the actions under the current policy versus the reference policy. The mathematical formula is: .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reference Policy ()
Policy Probability Ratio (Ratio Function)
An autonomous agent is being trained to navigate a maze. The agent's decision-making process at any given intersection (a 'state') is determined by a specific component of its programming. Which of the following scenarios best exemplifies this decision-making component?
An autonomous agent is programmed to navigate a grid. When it reaches a specific grid cell (state 'S'), it must choose an action. Consider two different versions of the agent's programming:
- Agent 1: When in state 'S', it is programmed to always choose the action 'move North'.
- Agent 2: When in state 'S', it is programmed to choose 'move North' with 70% probability and 'move East' with 30% probability.
Which statement best analyzes the difference in how these two agents map states to actions?
An agent's goal is to navigate a simple environment and maximize its total reward. The agent is currently in a state 'S'. From this state, it can take one of two actions: 'Action 1' which consistently leads to a reward of +10, or 'Action 2' which consistently leads to a reward of -5. Consider two possible behavior patterns for the agent when it is in state 'S':
- Behavior A: The agent chooses 'Action 1' with a 100% probability.
- Behavior B: The agent chooses 'Action 1' with a 50% probability and 'Action 2' with a 50% probability.
Which behavior pattern is superior for achieving the agent's goal, and why?
Learn After
Increased Action Probability Condition
Policy Probability Ratio Less Than One
Bound Function for Policy Probability Ratio
Policy Probability Ratio Greater Than One
Upper-Bound Clipping Function for Policy Ratios
Evaluating a Policy Change
In an off-policy reinforcement learning scenario, an agent is in a specific state. The policy that originally collected the training data (the reference policy) selected a particular action with a probability of 0.2. The agent's current, updated policy would select that same action with a probability of 0.8. What does the resulting probability ratio imply about how the reward for this action-state pair should be treated during the policy update?
Interpreting Policy Changes