Learn Before
In the context of estimating the advantage of taking an action a_t in a state s_t, the formula A(s_t, a_t) = (∑_{k=t}^{T} r_k) - V(s_t) is often used. What is the primary role of the reward-to-go term, ∑_{k=t}^{T} r_k, within this specific estimation?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Policy Gradient with Reward-to-Go and Baseline
Calculating Advantage from a Trajectory
In the context of estimating the advantage of taking an action
a_tin a states_t, the formulaA(s_t, a_t) = (∑_{k=t}^{T} r_k) - V(s_t)is often used. What is the primary role of the reward-to-go term,∑_{k=t}^{T} r_k, within this specific estimation?In a given trajectory, if the calculated advantage
A(s_t, a_t) = (∑_{k=t}^{T} r_k) - V(s_t)is negative, it implies that the actiona_ttaken in states_tled to a sequence of rewards that was worse than the average expected outcome from that state.