Learn Before
Approximated Policy Divergence Penalty Formula
In practice, the policy divergence penalty is approximated by summing the differences in log-probabilities for each action-state pair over a trajectory, rather than using the log-probability of the entire trajectory. This simplification ignores the environment's dynamics. The formula is:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Approximated Policy Divergence Penalty Formula
In reinforcement learning, a penalty is often used to limit how much a new policy deviates from a previous one. The exact penalty considers the probability of an entire sequence of states and actions. A common practical simplification is to calculate this penalty based only on the sum of action probabilities at each step, effectively ignoring the environment's state transition probabilities. What is the primary consequence of this simplification?
Choosing a Policy Divergence Penalty
In reinforcement learning, a penalty is often used to prevent a policy from changing too drastically. The exact penalty is based on the probability of an entire sequence of states and actions. A common simplification calculates this penalty by summing the probabilities of each action taken, without considering the probabilities of transitioning between states.
Statement: This simplified approach is preferred because it provides a more precise measure of the policy's change by isolating the agent's decision-making process from environmental randomness.
Learn After
An agent executes an identical sequence of states and actions in two different environments, A and B. The agent's policy (π_θ) and a reference policy (π_θ_ref) are also the same in both scenarios. When calculating the approximated policy divergence penalty using the formula
Penalty = Σ [log π_θ(a_t|s_t) - log π_θ_ref(a_t|s_t)], the result is identical for both environments. What is the fundamental reason for this?Calculating Approximated Policy Divergence
Consider the approximated policy divergence penalty formula: This penalty's value for a fixed trajectory of states and actions is sensitive to changes in the environment's transition dynamics.