Multiple Choice

An engineer is training a policy model and wants to prevent large, destabilizing updates between training iterations. They modify their original objective function, J(θ), to a new objective function, J_new(θ) = J(θ) - β * D(θ, θ_old), where θ represents the current policy parameters, θ_old represents the parameters from the previous iteration, D is a function that measures the divergence between the two sets of parameters (a larger value means more divergence), and β is a positive coefficient. During optimization, the goal is to maximize J_new(θ). What is the primary effect of the - β * D(θ, θ_old) term on the training process?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science