Case Study

Analyzing Reward Model Parameter Updates

A machine learning engineer is training a model to predict a quality score for individual sentences (segments) in a longer text. The training process is designed to minimize the difference between the model's predicted scores and pre-calculated 'true' scores for each sentence. For a particular sentence, the 'true' score is 0.8, but the model's current prediction is 0.3. Explain the process that will occur during the next training step for this specific sentence. How does the training framework use the 'true' score of 0.8 and the predicted score of 0.3 to adjust the model's internal parameters, and what is the intended effect of this adjustment on the model's future predictions for similar sentences?

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science