Reward-to-Go
The reward-to-go, often denoted G_t, represents the cumulative reward from a specific time step t until the end of an episode. It is calculated as: In policy gradient methods, using the reward-to-go to weight an action's log-probability is a key variance reduction technique. It improves upon using the total trajectory reward by ensuring that an action's update is only influenced by subsequent rewards, which respects causality and provides a more accurate credit assignment.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Gradient Estimation from Sampled Trajectories
An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories
τsampled from the policyπ_θ:∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]In practice, this is estimated from a batch of
|D|sampled trajectories using the following formula:∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?
Policy Gradient with Baseline
Reward-to-Go
An agent is being trained using a policy gradient method. A batch of data
Dis collected, containing exactly two trajectories,τ_1andτ_2.- Trajectory
τ_1has a total rewardR(τ_1) = 10. - Trajectory
τ_2has a total rewardR(τ_2) = -5.
The gradient of the log-probability for each trajectory with respect to the policy parameters
θis denoted as∇_θ log Pr_θ(τ_1)and∇_θ log Pr_θ(τ_2), respectively.Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient
∇J(θ)for this batch?- Trajectory
Learn After
An agent completes a task, which consists of a sequence of states, actions, and rewards
(s_1, a_1, r_1), (s_2, a_2, r_2), ..., (s_T, a_T, r_T). To improve the agent's performance, we need to adjust the likelihood of taking each actiona_tat states_t. Consider two different ways to calculate the 'quality score' used to update the actiona_t:- Method 1: The score is the sum of all rewards in the sequence:
r_1 + r_2 + ... + r_T. - Method 2: The score is the sum of rewards from time step
tonward:r_t + r_{t+1} + ... + r_T.
Which of the following statements best explains why Method 2 is generally a more effective approach for training the agent than Method 1?
- Method 1: The score is the sum of all rewards in the sequence:
An agent completes an episode of 4 time steps, receiving the following sequence of rewards:
r_1 = -10,r_2 = +2,r_3 = +5,r_4 = -1. When updating the agent's decision-making process, what is the 'reward-to-go' value that should be associated with the action taken at time stept=2?Analyzing Credit Assignment for a Policy Update