True/False

A value model is being trained to estimate the expected future reward from a given state. Its loss is calculated as the squared difference between the model's prediction for the current state and a target value, where the target is the sum of the immediate reward and the discounted predicted value of the next state. During the backpropagation step to update the model's parameters, gradients are computed with respect to both the model's prediction for the current state and its prediction for the next state (which is part of the target).

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science