1Cademy - Diagnosing Training Stagnation in Joint Optimization

Learn Before

Joint Optimization of Policy and Value Functions in RLHF

Case Study

Diagnosing Training Stagnation in Joint Optimization

Analyze the following training scenario. Why might a highly accurate value function fail to produce a corresponding improvement in the policy's performance? In your explanation, distinguish between the optimization goal of the value function and the optimization goal of the policy.

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences