Formula

Policy Gradient Utility for Sequence Generation

In the context of training sequence generation models with reinforcement learning, the utility function UU for an input-output pair (x,y)(\mathbf{x}, \mathbf{y}) is defined based on the policy gradient objective. It is calculated by summing the log-probabilities of generating each token yty_t in the output sequence, weighted by an advantage function AA. The formula is: U(x,y;θ)=t=1Tlogπθ(ytx,y<t)A(x,y<t,yt)U(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t})A(\mathbf{x}, \mathbf{y}_{<t}, y_t) Here, πθ(ytx,y<t)=Prθ(ytx,y<t)\pi_{\theta}(y_t|\mathbf{x},\mathbf{y}_{<t}) = \mathrm{Pr}_{\theta}(y_t|\mathbf{x},\mathbf{y}_{<t}) represents the large language model parameterized by θ\theta. This utility measures the overall quality of the generated sequence y\mathbf{y} according to the policy πθ\pi_{\theta} and the advantage estimates.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related