1Cademy - An AI development team is in a phase of training where their goal is to make a language models responses more aligned with human preferences. They use an optimization process that aims to minimize a loss function, `L`, which takes an input prompt `x`, a set of model-generated responses `{y1, y2, ...}`, and a component `r` as inputs. How does this loss function `L` primarily guide the models policy towards generating better responses?

Learn Before

RLHF Policy Optimization as Loss Minimization

Multiple Choice

An AI development team is in a phase of training where their goal is to make a language model's responses more aligned with human preferences. They use an optimization process that aims to minimize a loss function, L, which takes an input prompt x, a set of model-generated responses {y1, y2, ...}, and a component r as inputs. How does this loss function L primarily guide the model's policy towards generating better responses?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related