Formula

Total Reward as Sum of Segment-Based Scores

The cumulative reward score for an entire output token sequence, represented as r(x,y)r(\mathbf{x}, \mathbf{y}), is determined by calculating the sum of the individual reward scores from all its segmented parts. The formal equation for this aggregation is: r(x,y)=k=1nsr(x,y,yˉk)r(\mathbf{x}, \mathbf{y}) = \sum_{k=1}^{n_s} r(\mathbf{x}, \mathbf{y}, \bar{\mathbf{y}}_k) In this formula, nsn_s stands for the total number of segments the sequence is divided into, and r(x,y,yˉk)r(\mathbf{x}, \mathbf{y}, \bar{\mathbf{y}}_k) denotes the computed reward for the kk-th segment. This total score is typically used to update and train the policy model as usual.

Image 0

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences