Multiple Choice

A value model is trained using a loss function that minimizes the squared difference between its current value prediction, V(st)V(s_t), and a target value calculated as the sum of the immediate reward and the discounted value of the next state, rt+γV(st+1)r_t + \gamma V(s_{t+1}). Why is the squared difference used as the core of this loss function, rather than simply the absolute difference or another metric?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science