Learn Before
You are fine-tuning a large language model using a reinforcement learning process that involves both a policy (the language model itself) and a value function (a 'critic'). For a single training instance based on one input prompt, arrange the following events in the correct chronological order.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Basic A2C Formulation for LLMs
Prevalence of Advanced RL Algorithms in RLHF
During the fine-tuning of a large language model using an Advantage Actor-Critic (A2C) method, the model generates a response to a given prompt. This response is then evaluated to guide the model's learning process. Which of the following statements best describes the distinct roles of the 'actor' and the 'critic' in a single update step?
You are fine-tuning a large language model using a reinforcement learning process that involves both a policy (the language model itself) and a value function (a 'critic'). For a single training instance based on one input prompt, arrange the following events in the correct chronological order.
Diagnosing Training Instability in LLM Alignment
During a fine-tuning step for a large language model using an Advantage Actor-Critic (A2C) approach, the model generates a response to a prompt. The reward for this response, as determined by a separate reward model, is significantly higher than the critic's baseline value estimate for that prompt. What is the most likely immediate consequence for the language model's parameters during the subsequent policy update?