1Cademy - An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the models learning process: * **Scheme A:** {+1 for politeness, +2 for helpfulness, -100 for rudeness} * **Scheme B:** {+5 for politeness, +10 for helpfulness, -15 for rudeness} Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?

Learn Before

Impact of Reward Scale Variation on Policy Gradient Variance

Multiple Choice

An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:

Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}

Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related