Learn Before
Choosing an Alignment Method for a Resource-Constrained Project
Given the lab's constraints described in the case study, which of the following two alignment strategies would be more suitable, and why?
- Strategy A: An approach that requires iteratively sampling new responses from the model and updating the model in an online loop.
- Strategy B: An approach that can directly learn from the existing, static dataset of preferences without needing to generate new samples during training.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Choosing an Alignment Method for a Resource-Constrained Project
Which of the following best analyzes the primary reason why Direct Policy Optimization (DPO) is considered more sample-efficient than Proximal Policy Optimization (PPO) for aligning language models?
The primary reason Direct Policy Optimization (DPO) is considered more sample-efficient than Proximal Policy Optimization (PPO) is that DPO requires actively collecting new preference data from an online environment throughout its training process.