1Cademy - A team is training a language model to act as a programming assistant that writes code to solve specific problems. Their training method involves running the code generated by the model. If the code executes without errors and produces the correct output for a set of predefined tests, the model receives a high reward. If the code fails to execute or produces the wrong output, it receives a low reward. The system does not evaluate the elegance, efficiency, or style of the code itself, only the fi

Learn Before

Outcome Reward Models

Multiple Choice

A team is training a language model to act as a programming assistant that writes code to solve specific problems. Their training method involves running the code generated by the model. If the code executes without errors and produces the correct output for a set of predefined tests, the model receives a high reward. If the code fails to execute or produces the wrong output, it receives a low reward. The system does not evaluate the elegance, efficiency, or style of the code itself, only the fi

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related