Learn Before
Insufficiency of Outcome-Based Rewards for Complex Reasoning
For tasks that require complex reasoning, reward models that only evaluate the correctness of the final output are insufficient for effective learning. This is because such outcome-based feedback does not provide information about errors made during the reasoning process, thus failing to guide the model on how to improve its problem-solving steps.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Example of an Outcome-Based Reward Model in Mathematics
Insufficiency of Outcome-Based Rewards for Complex Reasoning
A company is training a language model to act as an automated assistant for processing loan applications. The model must follow a specific, legally-mandated, multi-step procedure to ensure fairness and compliance (e.g., checking credit history, verifying income, providing specific disclosures). The company decides to train the model using a system that provides a large positive reward only if the final loan decision (approve/deny) is correct based on the applicant's overall profile. What is the most significant weakness of this training strategy?
Evaluating Reward Model Suitability
Reward Model Suitability for a Creative Task
Learn After
Learning Analogy: Outcome vs. Process Feedback
A research team is training a language model to act as a programming assistant that writes complex, multi-step code functions. The training method rewards the model only if the final generated code executes without errors and produces the correct output. Despite extensive training, the model frequently generates code that is logically flawed, even if it sometimes produces the correct final result for the training examples. Which of the following statements best analyzes the fundamental weakness of this training approach?
Diagnosing Training Flaws in a Math AI
Critique of AI Training Methodologies for Complex Tasks