Short Answer

Choosing an Objective Function for Stable Policy Updates

An AI engineer is training a reinforcement learning agent. They are considering two different objective functions to maximize for updating the agent's policy parameters from θ_old to θ_new:

  • Objective A: J(θ_new)
  • Objective B: J(θ_new) - α * D(θ_new, θ_old)

Here, J(θ) is the standard performance measure, D is a function that calculates the difference between the new and old policy parameters (a larger value means a bigger change), and α is a positive constant.

Explain why Objective B is generally preferred for ensuring stable and reliable training progress. In your explanation, describe the role of the term - α * D(θ_new, θ_old).

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science