1Cademy - Diagnosing Flawed AI Training

Learn Before

Dual Learning Tasks of RLHF: Reward and Policy Learning

Case Study

Diagnosing Flawed AI Training

A development team is refining a large language model using a two-stage training process. In the first stage, they train a 'scoring' model on a dataset of human-ranked conversations to predict which responses humans would prefer. In the second stage, they use this scoring model to provide feedback to the main language model, encouraging it to generate responses that receive high scores. After deployment, the team notices that the model often produces responses that are grammatically perfect and very polite, but are also extremely vague and non-committal, effectively avoiding answering the user's actual question. During internal testing, they confirm that the scoring model consistently gives these vague responses very high ratings. Based on this information, which of the two stages is the most likely root cause of the model's unhelpful behavior, and why?

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Learn Before

Related