Learn Before
A machine learning team is implementing a training process that uses human feedback to align a language model. They have access to two base models: a general-purpose pre-trained language model (Model A) and a version of that model that has been further fine-tuned on a set of instructions (Model B). For the first stage of their process, which of the following initialization plans is correct for the policy, reference, reward, and value models?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Data Collection for Reward Modeling in RLHF
A machine learning team is implementing a training process that uses human feedback to align a language model. They have access to two base models: a general-purpose pre-trained language model (Model A) and a version of that model that has been further fine-tuned on a set of instructions (Model B). For the first stage of their process, which of the following initialization plans is correct for the policy, reference, reward, and value models?
Rationale for Freezing the Reference Model in RLHF
Analyzing an RLHF Initialization Error