Learn Before
Reward Model Suitability for a Creative Task
A developer is training a language model to generate short, engaging, and original marketing slogans. They decide to use a reward system that gives a high score to slogans that human raters find creative and a low score to those they find uninspired. This system does not analyze the intermediate steps the model took to generate the slogan. Explain why this focus on the final output is a particularly effective strategy for this specific training goal.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Example of an Outcome-Based Reward Model in Mathematics
Insufficiency of Outcome-Based Rewards for Complex Reasoning
A company is training a language model to act as an automated assistant for processing loan applications. The model must follow a specific, legally-mandated, multi-step procedure to ensure fairness and compliance (e.g., checking credit history, verifying income, providing specific disclosures). The company decides to train the model using a system that provides a large positive reward only if the final loan decision (approve/deny) is correct based on the applicant's overall profile. What is the most significant weakness of this training strategy?
Evaluating Reward Model Suitability
Reward Model Suitability for a Creative Task