1Cademy - A reinforcement learning agent, in a state `s_t` with an estimated value `V(s_t) = 50`, takes an action. This action yields an immediate reward `r = 5` and transitions the agent to a new state `s_{t+1}` with an estimated value `V(s_{t+1}) = 40`. Assuming a discount factor `γ = 0.9`, the agents learning algorithm uses the quantity `r + γV(s_{t+1}) - V(s_t)` to update its policy. How should the agent interpret the outcome of this action?

Learn Before

Advantage Function as a Form of Shaped Reward

Multiple Choice

A reinforcement learning agent, in a state s_t with an estimated value V(s_t) = 50, takes an action. This action yields an immediate reward r = 5 and transitions the agent to a new state s_{t+1} with an estimated value V(s_{t+1}) = 40. Assuming a discount factor γ = 0.9, the agent's learning algorithm uses the quantity r + γV(s_{t+1}) - V(s_t) to update its policy. How should the agent interpret the outcome of this action?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related