An engineer is training a reinforcement learning agent using a policy-based method. They observe the following training behavior: the agent's performance steadily improves for several iterations, but then suddenly collapses, becoming significantly worse than before. This pattern of gradual improvement followed by a catastrophic drop in performance repeats. Which of the following statements provides the most likely explanation for this unstable training dynamic?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Penalty-Based Trust Region Implementation
Trust Region Policy Optimization
An engineer is training a reinforcement learning agent using a policy-based method. They observe the following training behavior: the agent's performance steadily improves for several iterations, but then suddenly collapses, becoming significantly worse than before. This pattern of gradual improvement followed by a catastrophic drop in performance repeats. Which of the following statements provides the most likely explanation for this unstable training dynamic?
Stabilizing Policy Updates in Reinforcement Learning
The Trust Region Size Trade-off