Learn Before
Analyzing Policy Gradient Updates for Text Generation
A language model is being fine-tuned with reinforcement learning to generate text with positive sentiment. The advantage function A is derived from a sentiment score. For a given input, the model generates two candidate sequences. Your task is to analyze which sequence provides a more favorable outcome according to the policy gradient utility function. Calculate the total utility U for each sequence and explain which one the objective function would favor and why.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A sequence generation model is being trained using a policy gradient method. At a specific step
t, the model must choose between two possible next tokens: 'innovative' and 'effective'. The model's internal calculations provide the following values for this step:-
For token 'innovative':
- Log-probability
log π(y_t|...): -3.0 - Advantage
A(...): +4.0
- Log-probability
-
For token 'effective':
- Log-probability
log π(y_t|...): -1.2 - Advantage
A(...): +2.0
- Log-probability
Based on the utility function
Uused in policy gradient methods, which is a sum oflog π * Aterms over the sequence, which token's selection results in a larger (i.e., less negative) contribution to the total utilityUfor the entire sequence at this specific stept?-
Analyzing Policy Gradient Updates for Text Generation
Consider a sequence generation model being trained with a policy gradient method. If the advantage function
A(x, y_<t, y_t)returns a negative value for a specific tokeny_tat a given step, the training objective will encourage the model to increase the probability of selecting that tokeny_tin similar future situations.Basic A2C Formulation for LLMs