Two language models generate different reasoning paths (Path A and Path B) to solve the same complex problem. A reward model assesses each step, providing the log-probability of that step being 'correct'. The total reward for a path is the sum of these log-probabilities.
- Path A log-probabilities: [-0.4, -0.5, -0.3]
- Path B log-probabilities: [-0.1, -2.5, -0.2]
Based on this reward calculation method, which statement accurately compares the two paths?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Application of Log-Probability-Based Reward in RLHF Policy Training
Evaluating Reasoning Path Quality
A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being 'correct':
- Step 1: -0.2
- Step 2: -0.5
- Step 3: -0.1
According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?
Two language models generate different reasoning paths (Path A and Path B) to solve the same complex problem. A reward model assesses each step, providing the log-probability of that step being 'correct'. The total reward for a path is the sum of these log-probabilities.
- Path A log-probabilities: [-0.4, -0.5, -0.3]
- Path B log-probabilities: [-0.1, -2.5, -0.2]
Based on this reward calculation method, which statement accurately compares the two paths?