1Cademy - Stabilizing Reinforcement Learning Training

Learn Before

Penalty-Based Trust Region Implementation

Case Study

Stabilizing Reinforcement Learning Training

Given the following case study, propose a modification to the training objective J(θ) to address the observed instability. Explain the role of each component you add and why this modification is expected to lead to more stable learning.

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Log-Probability Difference as a Policy Divergence Penalty
An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function, J(θ), to a new objective function, J_new(θ) = J(θ) - β * D(θ, θ_old), where θ represents the current policy parameters, θ_old represents the parameters from the previous iteration, D is a function that measures the divergence between the two sets of parameters (a larger value means more divergence), and β is a positive co
Stabilizing Reinforcement Learning Training
Choosing an Objective Function for Stable Policy Updates
Stabilizing Policy Updates with a Divergence Penalty
When implementing a penalty-based trust region for policy optimization where the goal is to maximize the objective function, increasing the weight of the penalty term will expand the trusted area, allowing the policy to make larger updates.

Learn Before

Related