1Cademy - A language model is being trained to minimize the following objective function:<br><br>`Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]`<br><br>During one training step, the current policy `π_θ` generates a response `y` that is highly creative and receives a very high `reward(x, y)`. However, this response is stylistically very different from the typical outputs of the reference policy `π_θ_ref`, resulting in a very low probability `π_θ_ref(y|x)`. Assuming `β` is a positive constant, how does this specific generation `(x, y)` influence the two main components of the objective function for this step?

Learn Before

Conceptual Objective Function Assumed in DPO

Multiple Choice

A language model is being trained to minimize the following objective function:

Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]

During one training step, the current policy π_θ generates a response y that is highly creative and receives a very high reward(x, y). However, this response is stylistically very different from the typical outputs of the reference policy π_θ_ref, resulting in a very low probability π_θ_ref(y|x). Assuming β is a positive constant, how does this specific generation (x, y) influence the two main components of the objective function for this step?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related