Multiple Choice

An agent is being trained using a reinforcement learning algorithm where the value network's loss is based on the mean squared temporal difference (TD) error. For a single transition, the agent moves from state s_t to s_{t+1}, receiving a reward r_t. The value network predicts the value of the current state as V(s_t) and the next state as V(s_{t+1}). Given the following values, calculate the squared TD error, which represents the loss for this single sample before averaging:

  • Reward r_t = 2
  • Discount factor γ = 0.9
  • Predicted value of current state V(s_t) = 5.0
  • Predicted value of next state V(s_{t+1}) = 4.0

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science