Activity (Process)

Application of Segment-Based Total Reward in Policy Training

The total reward score, r(x,y)r(\mathbf{x}, \mathbf{y}), which is aggregated from the scores of individual segments of a generated output, is utilized as the primary reward signal in the standard training process for the policy model.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences