1Cademy - A team is training a language model to solve multi-step logic puzzles. Their training system automatically reviews each line of the models generated reasoning. If a line represents a valid deductive step, it receives a positive score. If a line contains a logical fallacy or contradicts a previous statement, it receives a negative score, and the evaluation stops. The total score for the entire reasoning path is then used to update the model. Which classification best describes this type of feedb

Learn Before

Classification of Reward Models for LLM Reasoning

Multiple Choice

A team is training a language model to solve multi-step logic puzzles. Their training system automatically reviews each line of the model's generated reasoning. If a line represents a valid deductive step, it receives a positive score. If a line contains a logical fallacy or contradicts a previous statement, it receives a negative score, and the evaluation stops. The total score for the entire reasoning path is then used to update the model. Which classification best describes this type of feedb

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related