Activity (Process)

Application of Log-Probability-Based Reward in RLHF Policy Training

The cumulative reward score, calculated by summing the log-probabilities of correctness for each step in a reasoning path, serves as the reward signal for the standard policy training phase in Reinforcement Learning from Human Feedback (RLHF).

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences