1Cademy - A sequence generation model is being trained using a policy gradient method. At a specific step `t`, the model must choose between two possible next tokens: innovative and effective. The models internal calculations provide the following values for this step: - For token innovative: - Log-probability `log π(y_t|...)`: -3.0 - Advantage `A(...)`: +4.0 - For token effective: - Log-probability `log π(y_t|...)`: -1.2 - Advantage `A(...)`: +2.0 Based on the utility function `U` used in policy gradient methods, which is a sum of `log π * A` terms over the sequence, which tokens selection results in a larger (i.e., less negative) contribution to the total utility `U` for the entire sequence at this specific step `t`?

Learn Before

Policy Gradient Utility for Sequence Generation

Multiple Choice

A sequence generation model is being trained using a policy gradient method. At a specific step t, the model must choose between two possible next tokens: 'innovative' and 'effective'. The model's internal calculations provide the following values for this step:

For token 'innovative':
- Log-probability log π(y_t|...): -3.0
- Advantage A(...): +4.0
For token 'effective':
- Log-probability log π(y_t|...): -1.2
- Advantage A(...): +2.0

Based on the utility function U used in policy gradient methods, which is a sum of log π * A terms over the sequence, which token's selection results in a larger (i.e., less negative) contribution to the total utility U for the entire sequence at this specific step t?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related