1Cademy - An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function, `J(θ)`, to a new objective function, `J_new(θ) = J(θ) - β * D(θ, θ_old)`, where `θ` represents the current policy parameters, `θ_old` represents the parameters from the previous iteration, `D` is a function that measures the divergence between the two sets of parameters (a larger value means more divergence), and `β` is a positive coefficient. During optimization, the goal is to maximize `J_new(θ)`. What is the primary effect of the ` - β * D(θ, θ

Learn Before

Penalty-Based Trust Region Implementation

Multiple Choice

An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function, J(θ), to a new objective function, J_new(θ) = J(θ) - β * D(θ, θ_old), where θ represents the current policy parameters, θ_old represents the parameters from the previous iteration, D is a function that measures the divergence between the two sets of parameters (a larger value means more divergence), and β is a positive coefficient. During optimization, the goal is to maximize J_new(θ). What is the primary effect of the - β * D(θ, θ_old) term on the training process?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related