Multiple Choice

A reward model is being trained using a pair-wise ranking loss function. For a given prompt x, the preference dataset contains a pair of responses: a preferred response y_pref and a rejected response y_rej. Initially, the model assigns the following scores: R(x, y_pref) = 2.0 and R(x, y_rej) = 3.0. Based on the objective of the loss function, what is the most likely change to these scores after a single optimization step on this data point?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science