Multiple Choice

In a method for training a decision-making agent, an update rule is derived. Consider the following intermediate expression used to calculate the gradient for a single trajectory of states, actions, and rewards:

θJ(θ)t=1T[θlogπθ(atst)((k=1t1rk)+(k=tTrk)b)]\nabla_\theta J(\theta) \propto \sum_{t=1}^{T} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \left( \left( \sum_{k=1}^{t-1} r_k \right) + \left( \sum_{k=t}^{T} r_k \right) - b \right) \right]

Here, t is a specific timestep within the trajectory of length T, \pi_\theta(a_t|s_t) is the probability of taking action a_t in state s_t, r_k is the reward at timestep k, and b is a constant value. Which statement best analyzes the relationship between the policy term for timestep t ( \nabla_\theta \log \pi_\theta(a_t|s_t) ) and the two components of the reward sum?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science