Definition

Reward-to-Go

The reward-to-go, often denoted G_t, represents the cumulative reward from a specific time step t until the end of an episode. It is calculated as: Gt=k=tTrkG_t = \sum_{k=t}^{T} r_k In policy gradient methods, using the reward-to-go to weight an action's log-probability is a key variance reduction technique. It improves upon using the total trajectory reward by ensuring that an action's update is only influenced by subsequent rewards, which respects causality and provides a more accurate credit assignment.

Image 0

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences