Case Study

Analyzing a Policy Update Step

A language model is being fine-tuned using a reinforcement learning process. During one update step, for the same initial text, the model considers two different generated continuations. Continuation A receives an advantage score of +2.0, while Continuation B receives an advantage score of -1.5. Based on the objective function that aims to maximize the sum of log probabilities of actions multiplied by their advantages, describe the effect of this single update step on the model's tendency to generate each of these continuations in the future. Explain your reasoning for both continuations.

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science