1Cademy - An autoregressive language model is generating the two-token response Good day given a prompt. The table below shows the per-token log-probabilities from the current policy being trained ($ ext{Pr}_{ heta}$) and a fixed reference policy ($ ext{Pr}_{ heta_{ ext{ref}}}$). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token. | Token | $\log \text{Pr}_{\theta}(y_t|\dots)$ | $\log \text{Pr}_{\theta_{\text{ref}}}(y_t|\dots)$ | | :--- | :---: | :---: | | Good | -0.8 | -1.5 | | day | -0.4 | -2.1 | Based on this data, what can be concluded about the current policys behavior for this specific generation?

Learn Before

Policy Divergence Penalty for Language Models

Multiple Choice

An autoregressive language model is generating the two-token response 'Good day' given a prompt. The table below shows the per-token log-probabilities from the current policy being trained ( $ext{Pr}_{ heta}$ ) and a fixed reference policy ( $ext{Pr}_{ heta_{ ext{ref}}}$ ). The policy divergence penalty is calculated as the sum of the differences between the log-probabilities of the current and reference policies for each token.

| Token | $\log \text{Pr}_{\theta}(y_t|\dots)$ | $\log \text{Pr}_{\theta_{\text{ref}}}(y_t|\dots)$ | | :--- | :---: | :---: | | 'Good' | -0.8 | -1.5 | | 'day' | -0.4 | -2.1 |

Based on this data, what can be concluded about the current policy's behavior for this specific generation?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related