1Cademy - An agent is being trained using a policy gradient method. The objective is to maximize the function $U = \sum_{t} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)$, where $\pi_{\theta}$ is the policy and $A$ is the advantage function which indicates how much better an action is than the average. At a specific state $s$, the agent can choose from three actions: $a_1, a_2, a_3$. The calculated advantage values for these actions are: - $A(s, a_1) = +2.5$ - $A(s, a_2) = -1.0$ - $A(s, a

Learn Before

Policy Gradient Objective with Advantage Function

Multiple Choice

An agent is being trained using a policy gradient method. The objective is to maximize the function $U = \sum_{t} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)$ , where $\pi_{\theta}$ is the policy and $A$ is the advantage function which indicates how much better an action is than the average.