Learn Before
Diagnosing Policy Update Instability
An engineer is training a reinforcement learning agent using the policy update rule shown below, where π_θ(a_t|s_t) is the agent's policy and A(s_t, a_t) is the advantage function. Based on the components of this formula, what characteristic of the policy π_θ itself is the most likely cause of the observed instability? Explain your reasoning.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An agent is learning a task using a policy update rule defined by the following equation, where
πθ(at|st)is the policy andA(st, at)is the advantage of taking actionatin statest:In a specific state
s, the agent takes an actionathat results in an advantage valueA(s, a) = -3.0. Based on this single experience, how will the update rule adjust the policyπθ?Diagnosing Policy Update Instability
A2C Actor Loss Function
Role of the Advantage Function in Policy Updates