Selecting a Reward Model for a Math Tutoring LLM
Your team is developing an LLM to act as a math tutor. The primary goal is for the LLM to generate solutions to word problems that are not only correct but also demonstrate a clear and valid step-by-step reasoning process for students to learn from. The team is debating how to design the reward model for reinforcement learning. Which of the two approaches described below would be more effective for achieving the project's primary goal, and why? Justify your choice by explaining the potential pitfalls of the rejected approach.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Outcome Reward Models
Process Reward Models
Rule-Based Reward Models for Reasoning
A team is training a language model to solve multi-step logic puzzles. Their training system automatically reviews each line of the model's generated reasoning. If a line represents a valid deductive step, it receives a positive score. If a line contains a logical fallacy or contradicts a previous statement, it receives a negative score, and the evaluation stops. The total score for the entire reasoning path is then used to update the model. Which classification best describes this type of feedback mechanism?
Selecting a Reward Model for a Math Tutoring LLM
Match each description of a feedback mechanism for training a reasoning model with the most appropriate classification.