1Cademy - Choosing an Objective Function for Stable Policy Updates

Learn Before

Penalty-Based Trust Region Implementation

Short Answer

Choosing an Objective Function for Stable Policy Updates

An AI engineer is training a reinforcement learning agent. They are considering two different objective functions to maximize for updating the agent's policy parameters from θ_old to θ_new:

Objective A: J(θ_new)
Objective B: J(θ_new) - α * D(θ_new, θ_old)

Here, J(θ) is the standard performance measure, D is a function that calculates the difference between the new and old policy parameters (a larger value means a bigger change), and α is a positive constant.

Explain why Objective B is generally preferred for ensuring stable and reliable training progress. In your explanation, describe the role of the term - α * D(θ_new, θ_old).

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related