Multiple Choice

An agent is learning a task using a policy update rule defined by the following equation, where πθ(at|st) is the policy and A(st, at) is the advantage of taking action at in state st:

J(θ)θ1DτDθ(t=1Tlogπθ(atst)A(st,at))\frac{\partial J(\theta)}{\partial \theta} \approx \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \frac{\partial}{\partial \theta} \left( \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t) A(s_t, a_t) \right)

In a specific state s, the agent takes an action a that results in an advantage value A(s, a) = -3.0. Based on this single experience, how will the update rule adjust the policy πθ?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science