1Cademy - A development team aims to align a large language model with human preferences. Their methodology is as follows: 1. For each input prompt, generate 16 different responses from the model. 2. Use a pre-trained reward model to assign a quality score to each of the 16 responses. 3. Select only the single highest-scoring response for that prompt. 4. Compile a new dataset consisting of thousands of these prompt-and-best-response pairs. 5. Fine-tune the original language model on this new dataset usi

Multiple Choice

A development team aims to align a large language model with human preferences. Their methodology is as follows:

For each input prompt, generate 16 different responses from the model.
Use a pre-trained 'reward model' to assign a quality score to each of the 16 responses.
Select only the single highest-scoring response for that prompt.
Compile a new dataset consisting of thousands of these prompt-and-best-response pairs.
Fine-tune the original language model on this new dataset usi

Updated 2025-09-28

Contributors are:

Who are from: