Evaluating Reasoning Path Quality
A language model generates two different reasoning paths (Path A and Path B) to solve the same problem. A separate reward model evaluates each step of both paths and assigns a log-probability of it being 'correct'. Based on the data below, which path would receive a higher total reward score if the score is calculated by summing the log-probabilities of each step? Justify your answer by showing the calculation for each path's total score.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Application of Log-Probability-Based Reward in RLHF Policy Training
Evaluating Reasoning Path Quality
A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being 'correct':
- Step 1: -0.2
- Step 2: -0.5
- Step 3: -0.1
According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?
Two language models generate different reasoning paths (Path A and Path B) to solve the same complex problem. A reward model assesses each step, providing the log-probability of that step being 'correct'. The total reward for a path is the sum of these log-probabilities.
- Path A log-probabilities: [-0.4, -0.5, -0.3]
- Path B log-probabilities: [-0.1, -2.5, -0.2]
Based on this reward calculation method, which statement accurately compares the two paths?