1Cademy - Analyzing Policy Gradient Updates for Text Generation

Learn Before

Policy Gradient Utility for Sequence Generation

Case Study

Analyzing Policy Gradient Updates for Text Generation

A language model is being fine-tuned with reinforcement learning to generate text with positive sentiment. The advantage function A is derived from a sentiment score. For a given input, the model generates two candidate sequences. Your task is to analyze which sequence provides a more favorable outcome according to the policy gradient utility function. Calculate the total utility U for each sequence and explain which one the objective function would favor and why.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related