Multiple Choice

An AI development team is choosing between two methods for aligning a language model with human preferences. Method A involves a multi-stage process: first, an explicit reward model is trained on preference data, and then this model is used to guide the language model's policy using reinforcement learning. Method B uses a simpler, single-stage process that directly optimizes the language model's policy on the preference data using a classification-style objective. What is the most significant implication of Method B's direct optimization approach compared to Method A's multi-stage approach?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related