Learn Before
Reference Policy ()
A reference policy, denoted , is a fixed policy used as a baseline in certain reinforcement learning algorithms. It is typically an earlier version of the policy being trained or a pre-trained model. The training process aims to improve the current policy () without deviating too far from the reference policy, which helps stabilize learning.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Reference Policy ()
Policy Probability Ratio (Ratio Function)
An autonomous agent is being trained to navigate a maze. The agent's decision-making process at any given intersection (a 'state') is determined by a specific component of its programming. Which of the following scenarios best exemplifies this decision-making component?
An autonomous agent is programmed to navigate a grid. When it reaches a specific grid cell (state 'S'), it must choose an action. Consider two different versions of the agent's programming:
- Agent 1: When in state 'S', it is programmed to always choose the action 'move North'.
- Agent 2: When in state 'S', it is programmed to choose 'move North' with 70% probability and 'move East' with 30% probability.
Which statement best analyzes the difference in how these two agents map states to actions?
An agent's goal is to navigate a simple environment and maximize its total reward. The agent is currently in a state 'S'. From this state, it can take one of two actions: 'Action 1' which consistently leads to a reward of +10, or 'Action 2' which consistently leads to a reward of -5. Consider two possible behavior patterns for the agent when it is in state 'S':
- Behavior A: The agent chooses 'Action 1' with a 100% probability.
- Behavior B: The agent chooses 'Action 1' with a 50% probability and 'Action 2' with a 50% probability.
Which behavior pattern is superior for achieving the agent's goal, and why?
Learn After
A team is fine-tuning a large language model (the 'active model') to improve its performance on a specific task. They use the original, pre-trained version of the model as a fixed baseline. During training, a penalty is applied to the active model whenever its output probabilities for generating the next piece of text diverge significantly from the baseline model's probabilities. What is the most likely reason for incorporating this penalty mechanism?
Analysis of Constrained vs. Unconstrained Model Training
Stabilizing Model Fine-Tuning