Learn Before
A team is implementing a training pipeline for a large language model using human feedback and a specific reinforcement learning algorithm. The process involves several distinct stages to align the model's outputs with human preferences. Arrange the following key stages of this training pipeline in the correct chronological order.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Diagnosing a Flawed Model Alignment Pipeline
A team is implementing a training pipeline for a large language model using human feedback and a specific reinforcement learning algorithm. The process involves several distinct stages to align the model's outputs with human preferences. Arrange the following key stages of this training pipeline in the correct chronological order.
A research team is fine-tuning a language model using a reinforcement learning pipeline guided by human feedback. They notice that while the policy is being updated, the training process is highly unstable. Upon investigation, they find that the value function's predictions of future rewards are consistently inaccurate. Which of the following is the most direct cause for the value function's failure to learn, given the standard implementation of this training process?