1Cademy - An AI model is being trained to generate step-by-step reasoning. The training process provides a reward for each complete reasoning path, calculated by summing the log-probabilities of each individual step being deemed correct. A higher (i.e., less negative) total reward is better. Consider the following four reasoning paths generated by the model, along with the log-probability of correctness for each step. Which path will be most strongly reinforced during the training process?

Learn Before

Application of Log-Probability-Based Reward in RLHF Policy Training

Multiple Choice

An AI model is being trained to generate step-by-step reasoning. The training process provides a reward for each complete reasoning path, calculated by summing the log-probabilities of each individual step being deemed 'correct'. A higher (i.e., less negative) total reward is better. Consider the following four reasoning paths generated by the model, along with the log-probability of correctness for each step. Which path will be most strongly reinforced during the training process?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related