Multiple Choice

A language model is being trained to minimize the following objective function:

Objective = E[-reward(x, y) + β * (log π_θ(y|x) - log π_θ_ref(y|x))]

During one training step, the current policy π_θ generates a response y that is highly creative and receives a very high reward(x, y). However, this response is stylistically very different from the typical outputs of the reference policy π_θ_ref, resulting in a very low probability π_θ_ref(y|x). Assuming β is a positive constant, how does this specific generation (x, y) influence the two main components of the objective function for this step?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science