Multiple Choice

A reward model is being trained using an objective function that aims to maximize the value of L=(sr)2L = -(s - r)^2 for each data point, where 's' is a fixed target score and 'r' is the model's predicted reward. For a particular data point, the target score 's' is 5.0. The model currently predicts a reward 'r' of 4.5. How would the value of L for this data point change if the model's prediction were updated to 5.5?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science