Formula

Policy Gradient Objective Function for RL Fine-Tuning

The objective function for the reinforcement learning fine-tuning phase of RLHF is based on the policy gradient method. The goal is to update the language model's policy parameters, θ\theta, to maximize the expected advantage of its actions. For a given trajectory τ\tau, the objective function UU is defined as:

U(τ;θ)=t=1Tlogπθ(atst)A(st,at)U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)

Here, πθ(atst)\pi_{\theta}(a_t|s_t) is the probability of the policy taking action ata_t in state sts_t, and A(st,at)A(s_t, a_t) is the advantage function, which measures how much better that action is compared to the average action in that state.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences