1Cademy - Analyzing Training Instability from Reward Design

Learn Before

Impact of Reward Scale Variation on Policy Gradient Variance

Case Study

Analyzing Training Instability from Reward Design

An engineer is training a language model to generate helpful and safe responses. The model receives a reward of +1 for each helpful sentence it produces. However, if any part of its response is flagged as unsafe, the entire response receives a reward of -100. The engineer observes that the training process is very unstable; the model struggles to improve consistently, and its performance fluctuates wildly between training updates. Based on this scenario, analyze the most probable cause of this training instability, specifically relating it to the design of the reward system.

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Learn Before

Related