1Cademy - A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being correct: * Step 1: -0.2 * Step 2: -0.5 * Step 3: -0.1 According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?

Path A log-probabilities: [-0.4, -0.5, -0.3]
Path B log-probabilities: [-0.1, -2.5, -0.2]

Learn Before

Formula for Log-Probability-Based Reward for Reasoning Paths

Multiple Choice

A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being 'correct':

Step 1: -0.2
Step 2: -0.5
Step 3: -0.1

According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?

Updated 2025-10-03

Contributors are: