1Cademy - An agent is learning a task using a policy update rule defined by the following equation, where `πθ(at|st)` is the policy and `A(st, at)` is the advantage of taking action `at` in state `st`:<br><br>$$ \frac{\partial J(\theta)}{\partial \theta} \approx \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) A(s_t, a_t) \right) $$<br><br>In a specific state `s`, the agent takes an action `a` that results in an advantage value `A(s, a) = -3.0`. Based on this single experience, how will the update rule adjust the policy `πθ`?

Learn Before

Policy Gradient with Advantage Function Formula

Multiple Choice

An agent is learning a task using a policy update rule defined by the following equation, where πθ(at|st) is the policy and A(st, at) is the advantage of taking action at in state st:

$\frac{\partial J(\theta)}{\partial \theta} \approx \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) A(s_t, a_t) \right)$

In a specific state s, the agent takes an action a that results in an advantage value A(s, a) = -3.0. Based on this single experience, how will the update rule adjust the policy πθ?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related