1Cademy - Application of Log-Probability-Based Reward in RLHF Policy Training

Learn Before

Formula for Log-Probability-Based Reward for Reasoning Paths

Activity (Process)

Application of Log-Probability-Based Reward in RLHF Policy Training

The cumulative reward score, calculated by summing the log-probabilities of correctness for each step in a reasoning path, serves as the reward signal for the standard policy training phase in Reinforcement Learning from Human Feedback (RLHF).

Updated 2025-10-10

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn After

An AI model is being trained to generate step-by-step reasoning. The training process provides a reward for each complete reasoning path, calculated by summing the log-probabilities of each individual step being deemed 'correct'. A higher (i.e., less negative) total reward is better. Consider the following four reasoning paths generated by the model, along with the log-probability of correctness for each step. Which path will be most strongly reinforced during the training process?
Evaluating AI Reasoning Paths
Critique of Log-Probability Reward Signals

Learn Before

Related

Learn After