Application of Log-Probability-Based Reward in RLHF Policy Training
The cumulative reward score, calculated by summing the log-probabilities of correctness for each step in a reasoning path, serves as the reward signal for the standard policy training phase in Reinforcement Learning from Human Feedback (RLHF).
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Application of Log-Probability-Based Reward in RLHF Policy Training
Evaluating Reasoning Path Quality
A language model generates a three-step reasoning path to solve a problem. A reward model evaluates each step and provides the following log-probabilities of each step being 'correct':
- Step 1: -0.2
- Step 2: -0.5
- Step 3: -0.1
According to the method that calculates the total reward by summing the log-probabilities of each step, what is the final reward score for this entire path?
Two language models generate different reasoning paths (Path A and Path B) to solve the same complex problem. A reward model assesses each step, providing the log-probability of that step being 'correct'. The total reward for a path is the sum of these log-probabilities.
- Path A log-probabilities: [-0.4, -0.5, -0.3]
- Path B log-probabilities: [-0.1, -2.5, -0.2]
Based on this reward calculation method, which statement accurately compares the two paths?
Learn After
An AI model is being trained to generate step-by-step reasoning. The training process provides a reward for each complete reasoning path, calculated by summing the log-probabilities of each individual step being deemed 'correct'. A higher (i.e., less negative) total reward is better. Consider the following four reasoning paths generated by the model, along with the log-probability of correctness for each step. Which path will be most strongly reinforced during the training process?
Evaluating AI Reasoning Paths
Critique of Log-Probability Reward Signals