1Cademy - In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:<br><br>`min E[-reward + β * (log_prob_policy - log_prob_reference)]`<br><br>If the hyperparameter `β` is set to an extremely large positive value, what is the most likely outcome for the optimized policy?

Learn Before

Conceptual Objective Function Assumed in DPO

Multiple Choice

In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:

min E[-reward + β * (log_prob_policy - log_prob_reference)]

If the hyperparameter β is set to an extremely large positive value, what is the most likely outcome for the optimized policy?

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Analyzing the Regularization Parameter in Policy Optimization
In the context of the training objective for a language model policy, consider the following formula where the goal is to find the optimal parameters by minimizing the expected value:

min E[-reward + β * (log_prob_policy - log_prob_reference)]

If the hyperparameter β is set to an extremely large positive value, what is the most likely outcome for the optimized policy?
A language model is being trained to minimize the following objective function:

Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]

During one training step, the current policy π_θ generates a response y that is highly creative and receives a very high reward(x, y). However, this response is stylistically very different from the typical outputs of the reference policy π_θ_ref, resulting in a very low probability π_θ_ref(y|x). Assuming β is a positive constant, how does this specific generation (x, y) influence the two main components of the objective function for this step?
Rearrangement of the Assumed DPO Objective
Unnormalized Target Distribution in the DPO Objective

Learn Before

Related