Learn Before
An agent is learning a task using a policy update rule defined by the following equation, where πθ(at|st) is the policy and A(st, at) is the advantage of taking action at in state st:
In a specific state s, the agent takes an action a that results in an advantage value A(s, a) = -3.0. Based on this single experience, how will the update rule adjust the policy πθ?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An agent is learning a task using a policy update rule defined by the following equation, where
πθ(at|st)is the policy andA(st, at)is the advantage of taking actionatin statest:In a specific state
s, the agent takes an actionathat results in an advantage valueA(s, a) = -3.0. Based on this single experience, how will the update rule adjust the policyπθ?Diagnosing Policy Update Instability
A2C Actor Loss Function
Role of the Advantage Function in Policy Updates