Multiple Choice

A sequence generation model is being trained using a policy gradient method. At a specific step t, the model must choose between two possible next tokens: 'innovative' and 'effective'. The model's internal calculations provide the following values for this step:

  • For token 'innovative':

    • Log-probability log π(y_t|...): -3.0
    • Advantage A(...): +4.0
  • For token 'effective':

    • Log-probability log π(y_t|...): -1.2
    • Advantage A(...): +2.0

Based on the utility function U used in policy gradient methods, which is a sum of log π * A terms over the sequence, which token's selection results in a larger (i.e., less negative) contribution to the total utility U for the entire sequence at this specific step t?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science