1Cademy - An agent is being trained using a reinforcement learning algorithm where the value networks loss is based on the mean squared temporal difference (TD) error. For a single transition, the agent moves from state `s_t` to `s_{t+1}`, receiving a reward `r_t`. The value network predicts the value of the current state as `V(s_t)` and the next state as `V(s_{t+1})`. Given the following values, calculate the squared TD error, which represents the loss for this single sample before averaging: - Reward `r_t` = 2 - Discount factor `γ` = 0.9 - Predicted value of current state `V(s_t)` = 5.0 - Predicted value of next state `V(s

Learn Before

Value Network Loss Function in A2C

Multiple Choice

An agent is being trained using a reinforcement learning algorithm where the value network's loss is based on the mean squared temporal difference (TD) error. For a single transition, the agent moves from state s_t to s_{t+1}, receiving a reward r_t. The value network predicts the value of the current state as V(s_t) and the next state as V(s_{t+1}). Given the following values, calculate the squared TD error, which represents the loss for this single sample before averaging:

Reward r_t = 2
Discount factor γ = 0.9
Predicted value of current state V(s_t) = 5.0
Predicted value of next state V(s_{t+1}) = 4.0

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related