Basic A2C Formulation for LLMs
In Reinforcement Learning from Human Feedback (RLHF), we typically lack a human-annotated input-output dataset and instead rely on an input-only dataset, denoted as . In this scenario, outputs are generated by sampling from the language model itself. The fundamental Advantage Actor-Critic (A2C) loss function is defined as . Here, indicates that the output sequence is sampled according to the policy , and is the utility function. While this formulation serves as a basis, more sophisticated reinforcement learning models are typically employed in practice.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Basic A2C Formulation for LLMs
Prevalence of Advanced RL Algorithms in RLHF
During the fine-tuning of a large language model using an Advantage Actor-Critic (A2C) method, the model generates a response to a given prompt. This response is then evaluated to guide the model's learning process. Which of the following statements best describes the distinct roles of the 'actor' and the 'critic' in a single update step?
You are fine-tuning a large language model using a reinforcement learning process that involves both a policy (the language model itself) and a value function (a 'critic'). For a single training instance based on one input prompt, arrange the following events in the correct chronological order.
Diagnosing Training Instability in LLM Alignment
During a fine-tuning step for a large language model using an Advantage Actor-Critic (A2C) approach, the model generates a response to a prompt. The reward for this response, as determined by a separate reward model, is significantly higher than the critic's baseline value estimate for that prompt. What is the most likely immediate consequence for the language model's parameters during the subsequent policy update?
A sequence generation model is being trained using a policy gradient method. At a specific step
t, the model must choose between two possible next tokens: 'innovative' and 'effective'. The model's internal calculations provide the following values for this step:-
For token 'innovative':
- Log-probability
log π(y_t|...): -3.0 - Advantage
A(...): +4.0
- Log-probability
-
For token 'effective':
- Log-probability
log π(y_t|...): -1.2 - Advantage
A(...): +2.0
- Log-probability
Based on the utility function
Uused in policy gradient methods, which is a sum oflog π * Aterms over the sequence, which token's selection results in a larger (i.e., less negative) contribution to the total utilityUfor the entire sequence at this specific stept?-
Analyzing Policy Gradient Updates for Text Generation
Consider a sequence generation model being trained with a policy gradient method. If the advantage function
A(x, y_<t, y_t)returns a negative value for a specific tokeny_tat a given step, the training objective will encourage the model to increase the probability of selecting that tokeny_tin similar future situations.Basic A2C Formulation for LLMs
Learn After
A language model's policy, , is being updated by minimizing the loss function , where is a given input, is an output generated by the model, and is a utility function that assigns a high score to desirable outputs and a low score to undesirable ones. What is the direct consequence of minimizing this loss function on the model's behavior?
Deconstructing the Reinforcement Learning Loss Function
A machine learning engineer is fine-tuning a large language model using a reinforcement learning approach. They mistakenly define the loss function to be minimized as , where is a utility function that returns high values for desirable outputs and low values for undesirable ones. What is the most likely outcome of this training process?
Prevalence of Advanced RL Algorithms in RLHF