1Cademy - Comparing Data Sourcing Strategies

Learn Before

Dataset Composition for RL Fine-Tuning in RLHF

Case Study

Comparing Data Sourcing Strategies

Two teams are fine-tuning a large language model.

Team Alpha uses a fixed dataset where each input prompt is paired with a single, pre-written 'gold standard' response authored by a human expert. The model is trained exclusively on these static pairs.
Team Beta starts with a large collection of input prompts but no pre-written responses. During each step of their training process, they take a prompt and have the current version of their model generate a response. This newly generated input-output pair is then used for that training step.

Analyze the fundamental difference in how the output portion of the training data is constructed for Team Beta compared to Team Alpha. What is the primary advantage of Team Beta's approach in terms of the model's potential to generate novel responses?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related