Critique of Log-Probability Reward Signals
A common method for training an AI to produce multi-step reasoning is to use a reward signal calculated by summing the log-probabilities of each step being correct. Evaluate this reward mechanism. In your response, discuss at least one significant advantage and one potential disadvantage or limitation of this approach compared to simpler, binary (correct/incorrect) reward schemes for the entire reasoning path.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI model is being trained to generate step-by-step reasoning. The training process provides a reward for each complete reasoning path, calculated by summing the log-probabilities of each individual step being deemed 'correct'. A higher (i.e., less negative) total reward is better. Consider the following four reasoning paths generated by the model, along with the log-probability of correctness for each step. Which path will be most strongly reinforced during the training process?
Evaluating AI Reasoning Paths
Critique of Log-Probability Reward Signals