Case Study

Analyzing Policy Gradient Updates for Text Generation

A language model is being fine-tuned with reinforcement learning to generate text with positive sentiment. The advantage function A is derived from a sentiment score. For a given input, the model generates two candidate sequences. Your task is to analyze which sequence provides a more favorable outcome according to the policy gradient utility function. Calculate the total utility U for each sequence and explain which one the objective function would favor and why.

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science