Outcome Reward Models
An outcome reward model is a type of verifier used in reinforcement learning for LLMs that evaluates the final answer of a reasoning process. It assesses the correctness or overall quality of the end result, providing a reward signal based solely on this final evaluation.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Outcome Reward Models
Process Reward Models
Rule-Based Reward Models for Reasoning
A team is training a language model to solve multi-step logic puzzles. Their training system automatically reviews each line of the model's generated reasoning. If a line represents a valid deductive step, it receives a positive score. If a line contains a logical fallacy or contradicts a previous statement, it receives a negative score, and the evaluation stops. The total score for the entire reasoning path is then used to update the model. Which classification best describes this type of feedback mechanism?
Selecting a Reward Model for a Math Tutoring LLM
Match each description of a feedback mechanism for training a reasoning model with the most appropriate classification.
Learn After
A team is training a language model to act as a programming assistant that writes code to solve specific problems. Their training method involves running the code generated by the model. If the code executes without errors and produces the correct output for a set of predefined tests, the model receives a high reward. If the code fails to execute or produces the wrong output, it receives a low reward. The system does not evaluate the elegance, efficiency, or style of the code itself, only the final result of its execution. Which of the following statements best characterizes this evaluation approach?
Analyzing a Reward System's Weakness
Evaluating a Reward System for an AI Tutor