1Cademy - Evaluating Reasoning Path Quality

Learn Before

Formula for Log-Probability-Based Reward for Reasoning Paths

Case Study

Evaluating Reasoning Path Quality

A language model generates two different reasoning paths (Path A and Path B) to solve the same problem. A separate reward model evaluates each step of both paths and assigns a log-probability of it being 'correct'. Based on the data below, which path would receive a higher total reward score if the score is calculated by summing the log-probabilities of each step? Justify your answer by showing the calculation for each path's total score.

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related