1Cademy - A system for modeling human preferences assigns a numerical reward score, `r`, to a given text response. This score can be positive, negative, or zero. To use these scores in a specific type of ranking probability model, each score `r` must be converted into a worth value `α` that is always positive and strictly increases as `r` increases. A researcher proposes using the function `α = r² + 0.1` for this conversion. Which statement correctly analyzes the suitability of this proposed function?

Learn Before

Worth Function in Plackett-Luce for RLHF Reward Modeling

Multiple Choice

A system for modeling human preferences assigns a numerical reward score, r, to a given text response. This score can be positive, negative, or zero. To use these scores in a specific type of ranking probability model, each score r must be converted into a 'worth' value α that is always positive and strictly increases as r increases. A researcher proposes using the function α = r² + 0.1 for this conversion. Which statement correctly analyzes the suitability of this proposed function?

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related