Multiple Choice

An agent is being trained using an algorithm where the value network's performance is measured by the mean squared difference between its predicted value for a state, V(s_t), and a computed target value, r_t + γ * V(s_{t+1}). During a particular training batch, the network consistently produces predictions V(s_t) that are significantly lower than the computed target values. What is the most direct effect on the network's parameters during the subsequent optimization step?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science