1Cademy - Policy Gradient Objective Function for RL Fine-Tuning

Learn Before

Dual Role of the RLHF Reward Model: Ranking-based Training for Scoring Application

Formula

Policy Gradient Objective Function for RL Fine-Tuning

The objective function for the reinforcement learning fine-tuning phase of RLHF is based on the policy gradient method. The goal is to update the language model's policy parameters, $\theta$ , to maximize the expected advantage of its actions. For a given trajectory $\tau$ , the objective function $U$ is defined as:

$U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t)$

Here, $\pi_{\theta}(a_t|s_t)$ is the probability of the policy taking action $a_t$ in state $s_t$ , and $A(s_t, a_t)$ is the advantage function, which measures how much better that action is compared to the average action in that state.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After