1Cademy - Choosing a Policy Divergence Penalty

Learn Before

Approximation of the Policy Divergence Penalty

Case Study

Choosing a Policy Divergence Penalty

Which method (A or B) would be more computationally expensive, and which would be less precise in measuring the true divergence in this specific environment? Justify your choice for each, explaining the trade-off the team faces.

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Approximated Policy Divergence Penalty Formula
In reinforcement learning, a penalty is often used to limit how much a new policy deviates from a previous one. The exact penalty considers the probability of an entire sequence of states and actions. A common practical simplification is to calculate this penalty based only on the sum of action probabilities at each step, effectively ignoring the environment's state transition probabilities. What is the primary consequence of this simplification?
Choosing a Policy Divergence Penalty
In reinforcement learning, a penalty is often used to prevent a policy from changing too drastically. The exact penalty is based on the probability of an entire sequence of states and actions. A common simplification calculates this penalty by summing the probabilities of each action taken, without considering the probabilities of transitioning between states.

Statement: This simplified approach is preferred because it provides a more precise measure of the policy's change by isolating the

Learn Before

Related