Learn Before
Consider a sequence generation model being trained with a policy gradient method. If the advantage function A(x, y_<t, y_t) returns a negative value for a specific token y_t at a given step, the training objective will encourage the model to increase the probability of selecting that token y_t in similar future situations.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A sequence generation model is being trained using a policy gradient method. At a specific step
t, the model must choose between two possible next tokens: 'innovative' and 'effective'. The model's internal calculations provide the following values for this step:-
For token 'innovative':
- Log-probability
log π(y_t|...): -3.0 - Advantage
A(...): +4.0
- Log-probability
-
For token 'effective':
- Log-probability
log π(y_t|...): -1.2 - Advantage
A(...): +2.0
- Log-probability
Based on the utility function
Uused in policy gradient methods, which is a sum oflog π * Aterms over the sequence, which token's selection results in a larger (i.e., less negative) contribution to the total utilityUfor the entire sequence at this specific stept?-
Analyzing Policy Gradient Updates for Text Generation
Consider a sequence generation model being trained with a policy gradient method. If the advantage function
A(x, y_<t, y_t)returns a negative value for a specific tokeny_tat a given step, the training objective will encourage the model to increase the probability of selecting that tokeny_tin similar future situations.Basic A2C Formulation for LLMs