1Cademy - A language model is generating a completion for an input `x`. The model has a base probability distribution, `π(y|x)`, for four potential completions (`y`). To steer the models output, a reward function, `r(x, y)`, is applied to create a new unnormalized score for each completion using the formula: `Score(y) = π(y|x) * exp(r(x, y))`. Given the values below, which completion will have the highest score?

Learn Before

Formula for Re-weighting a Probability Distribution with a Reward Function

Multiple Choice

A language model is generating a completion for an input x. The model has a base probability distribution, π(y|x), for four potential completions (y). To steer the model's output, a reward function, r(x, y), is applied to create a new unnormalized score for each completion using the formula: Score(y) = π(y|x) * exp(r(x, y)). Given the values below, which completion will have the highest score?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related