Policy Learning Loss Function in RLHF
The loss function for the policy learning stage in RLHF is defined as the negative expected utility of the model's outputs. The objective is to find the policy parameters that minimize this loss, which is equivalent to maximizing the expected utility. The formula is:
Where:
- denotes the input-only dataset.
- signifies that the output is sampled from the probability distribution defined by the language model's policy, , given the input .
- is a utility function that scores the quality of the output for the input .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Learning Loss Function in RLHF
A development team is refining a language model to generate more helpful responses. They have a collection of user prompts but lack a corresponding set of 'gold standard' correct answers. However, they do have an automated system that can assign a numerical 'helpfulness' score to any response the model generates for a given prompt. To improve the model, the team needs to define a loss function for this training phase. Which of the following best describes the principle they should use to formulate this loss function?
Role of the Loss Function in Policy Learning
Optimizing a Chatbot for User Engagement
Learn After
An engineer is training a language model where the training objective is to adjust the model's parameters to maximize a utility score for its generated outputs. The loss function is defined as the negative of the expected utility score. During a training run, the engineer observes that the calculated loss value is consistently increasing over several iterations (e.g., moving from -15.0 to -12.5 to -10.0). What is the most direct interpretation of this observation?
Rationale for the Negative Expected Utility Loss Function
Consequences of an Incorrect Loss Function Implementation