Multiple Choice

An engineer is training a language model for a customer service chatbot. They are deciding between two reward function designs to guide the model's learning process:

  • Scheme A: {+1 for politeness, +2 for helpfulness, -100 for rudeness}
  • Scheme B: {+5 for politeness, +10 for helpfulness, -15 for rudeness}

Which reward scheme is more likely to lead to a stable training process with lower gradient variance, and what is the most accurate reason?

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science