Multiple Choice

A development team aims to align a large language model with human preferences. Their methodology is as follows:

  1. For each input prompt, generate 16 different responses from the model.
  2. Use a pre-trained 'reward model' to assign a quality score to each of the 16 responses.
  3. Select only the single highest-scoring response for that prompt.
  4. Compile a new dataset consisting of thousands of these prompt-and-best-response pairs.
  5. Fine-tune the original language model on this new dataset using standard supervised learning methods.

Which statement most accurately evaluates this team's approach?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science