Log-Probability-Based Reward for Reasoning Paths
An alternative method for evaluating a reasoning path involves using the log-probabilities generated by the step-level classification model. Instead of simply counting the number of steps deemed 'correct', this approach aggregates the log-probabilities to form the total reward for the entire path, providing a more nuanced score.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Scoring Reasoning Paths by Counting Correct Steps
Log-Probability-Based Reward for Reasoning Paths
Practical Benefits of Detailed Supervision for Long Reasoning Paths
An AI team is building a supervisory model to assess each step in a multi-step reasoning process. The model receives the initial problem and all preceding steps as input, and it must output a judgment on whether the current step is 'correct' or 'incorrect'. Given this objective, which architectural component is most appropriate for the model's final layer, and why?
Designing a Reward Model for a Cooking Assistant
When a process-based reward model is framed as a classification task, its primary function is to output a single, continuous score (e.g., from 0.0 to 1.0) that represents the quality of a given reasoning step.
Learn After
Formula for Log-Probability-Based Reward for Reasoning Paths
Consider two methods for scoring a multi-step reasoning process generated by an AI. Both methods use an underlying model that, for each step, outputs a probability that the step is 'correct'.
- Method A: Assigns a score of +1 to each step where the probability of being 'correct' is greater than 0.5. The total score is the sum of these step scores.
- Method B: Calculates the total score by summing the logarithm of the 'correct' probability for every step in the process.
Now, analyze two reasoning paths for the same problem:
- Path 1: Consists of 3 steps, each with a 'correct' probability of 0.9.
- Path 2: Consists of 3 steps, each with a 'correct' probability of 0.6.
Which statement accurately compares how these two methods would score the paths?
Evaluating AI Reasoning Strategies
Nuanced Evaluation of Reasoning Paths