1Cademy - Policy Improvement Decision

Learn Before

Advantage Function Estimation in RLHF

Case Study

Policy Improvement Decision

Given the following scenario and assuming a discount factor of 0.9, calculate the advantage for both Option A and Option B. Based on your calculations, which action should the learning algorithm be encouraged to take, and why?

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Value Function Loss Minimization in RLHF
A language model is being trained to generate text. At a certain step, it considers generating the next token. The system has the following estimates:
- The value (expected future rewards) of the current state is 1.2.
- After generating a specific token, the immediate reward received is +0.5.
- The value of the new state after generating the token is 1.0.
- The discount factor for future rewards is 0.9.
Based on the standard temporal difference method for estimating the advantage, what is the
Policy Improvement Decision
Interpreting the Advantage Function

Learn Before

Related