Learn Before
Analyzing an RLHF Initialization Error
An engineer is setting up a training pipeline that uses human feedback. They initialize the policy and reference models from a general-purpose pre-trained language model. Conversely, they initialize the reward and value models from a model that has already been instruction fine-tuned. Identify the fundamental mistake in this setup and explain the reasoning behind the correct approach.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Collection for Reward Modeling in RLHF
A machine learning team is implementing a training process that uses human feedback to align a language model. They have access to two base models: a general-purpose pre-trained language model (Model A) and a version of that model that has been further fine-tuned on a set of instructions (Model B). For the first stage of their process, which of the following initialization plans is correct for the policy, reference, reward, and value models?
Rationale for Freezing the Reference Model in RLHF
Analyzing an RLHF Initialization Error