1Cademy - A system models preferences by first assigning a numerical reward score to a response and then converting it to a worth value using the formula: `worth = exp(reward_score)`. An engineer improves a response, causing its reward score to increase first from 2.0 to 3.0, and then with a further improvement, from 3.0 to 4.0. How does the *increase* in the responses worth value during the first improvement compare to the *increase* during the second improvement?

Learn Before

Worth Function in Plackett-Luce for RLHF Reward Modeling

Multiple Choice

A system models preferences by first assigning a numerical reward score to a response and then converting it to a 'worth' value using the formula: worth = exp(reward_score). An engineer improves a response, causing its reward score to increase first from 2.0 to 3.0, and then with a further improvement, from 3.0 to 4.0. How does the increase in the response's 'worth' value during the first improvement compare to the increase during the second improvement?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related