Learn Before
A development team is using reinforcement learning to train a language model to be a helpful math tutor. To encourage the model to provide detailed, step-by-step solutions, they implement a simple reward rule: the model receives a higher reward for generating longer responses that include more mathematical equations. Which of the following describes the most significant potential flaw in this approach?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A development team is using reinforcement learning to train a language model to be a helpful math tutor. To encourage the model to provide detailed, step-by-step solutions, they implement a simple reward rule: the model receives a higher reward for generating longer responses that include more mathematical equations. Which of the following describes the most significant potential flaw in this approach?
Designing a Reward Rule for Code Generation
Analyzing a Heuristic Reward for a Debate LLM