1Cademy - Policy Gradient Utility for Sequence Generation

Learn Before

Formula

Policy Gradient Utility for Sequence Generation

In the context of training sequence generation models with reinforcement learning, the utility function $U$ for an input-output pair $(\mathbf{x}, \mathbf{y})$ is defined based on the policy gradient objective. It is calculated by summing the log-probabilities of generating each token $y_t$ in the output sequence, weighted by an advantage function $A$ . The formula is: $U(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(y_t|\mathbf{x}, \mathbf{y}_{<t})A(\mathbf{x}, \mathbf{y}_{<t}, y_t)$ Here, $\pi_{\theta}(y_t|\mathbf{x},\mathbf{y}_{<t}) = \mathrm{Pr}_{\theta}(y_t|\mathbf{x},\mathbf{y}_{<t})$ represents the large language model parameterized by $\theta$ . This utility measures the overall quality of the generated sequence $\mathbf{y}$ according to the policy $\pi_{\theta}$ and the advantage estimates.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After