Google

In RLHF, the advantage function, denoting the advantage of taking action $$a_t$$ given state $$s_t$$, is commonly estimated using the Temporal Difference (TD) error. This estimate is used in both policy and value function updates. It is calculated by taking the immediate reward $$r_t$$, adding the discounted expected value of the next state $$\gamma V(s_{t+1})$$, and subtracting the estimated value of the current state $$V(s_t)$$. The formula is: $$A(s_t, a_t) = r_t + \gamma V(s_{t+1}) - V(s_t)$$. The state value function $$V(s_t)$$ is typically trained concurrently using the reward model.

Advantage Function as TD Error in RLHF

The value function, parameterized by $$\omega$$, is trained alongside the policy to estimate the expected future reward from a given state. Its parameters are updated by minimizing the Mean Squared Error (MSE) between the predicted state value, $$V_\omega(\mathbf{x},y_{<t})$$, and the computed return. The computed return is the sum of the immediate reward, $$r_t$$, and the discounted value of the next state, $$\gamma V_\omega(\mathbf{x},y_{<t+1})$$. The loss function is averaged over a dataset $$\mathcal{D}$$ and all token positions $$T$$: $$\min_{\omega} \frac{1}{M} \sum_{\mathbf{x} \in \mathcal{D}} \sum_{t=1}^{T} \left( r_t + \gamma V_\omega(\mathbf{x},y_{<t+1}) - V_\omega(\mathbf{x},y_{<t}) \right)^2$$

Value Function Loss Minimization in RLHF

Based on the scenario described, calculate the advantage value for generating the token 'sunlight'. Then, explain what the sign (positive or negative) of this calculated advantage value implies about the action taken at this step.

Analyzing a Single Training Step in Language Model Fine-Tuning

During the fine-tuning of a language model, at a specific step `t`, the model has generated the sequence `y_<t>` based on an initial prompt `x`. The value function estimates the value of this state, `V(x, y_<t>)`, to be 0.5. The model then generates the next token, `y_t`, and receives an immediate reward `r_t` of 0.1 from a reward model. The value function's estimate for the new state, `V(x, y_<t+1>)`, is 0.8. Assuming a discount factor `γ` of 0.9, calculate the advantage `A_t` for this step. Show your calculation.

Calculating the Advantage for a Single Token Generation

During the fine-tuning of a large language model, at a specific generation step `t`, the calculated advantage value is found to be significantly negative ($A_t < 0$). What is the most accurate interpretation of this outcome?

Learn Before

Related