1Cademy - A development team is refining a language model to generate more helpful responses. They have a collection of user prompts but lack a corresponding set of gold standard correct answers. However, they do have an automated system that can assign a numerical helpfulness score to any response the model generates for a given prompt. To improve the model, the team needs to define a loss function for this training phase. Which of the following best describes the principle they should use to formula

Learn Before

Formulating the Loss Function for Policy Learning in RLHF

Multiple Choice

A development team is refining a language model to generate more helpful responses. They have a collection of user prompts but lack a corresponding set of 'gold standard' correct answers. However, they do have an automated system that can assign a numerical 'helpfulness' score to any response the model generates for a given prompt. To improve the model, the team needs to define a loss function for this training phase. Which of the following best describes the principle they should use to formula

Updated 2025-10-01

Contributors are:

Who are from:

Learn Before

Related