Multiple Choice

An autoregressive language model is generating the two-token response 'Good day' given a prompt. The table below shows the per-token log-probabilities from the current policy being trained (extPrheta ext{Pr}_{ heta}) and a fixed reference policy (extPrhetaextref ext{Pr}_{ heta_{ ext{ref}}}). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token.

| Token | logPrθ(yt)\log \text{Pr}_{\theta}(y_t|\dots) | logPrθref(yt)\log \text{Pr}_{\theta_{\text{ref}}}(y_t|\dots) | | :--- | :---: | :---: | | 'Good' | -0.8 | -1.5 | | 'day' | -0.4 | -2.1 |

Based on this data, what can be concluded about the current policy's behavior for this specific generation?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science