Case Study

Analyzing Training Instability from Reward Design

An engineer is training a language model to generate helpful and safe responses. The model receives a reward of +1 for each helpful sentence it produces. However, if any part of its response is flagged as unsafe, the entire response receives a reward of -100. The engineer observes that the training process is very unstable; the model struggles to improve consistently, and its performance fluctuates wildly between training updates. Based on this scenario, analyze the most probable cause of this training instability, specifically relating it to the design of the reward system.

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science