1Cademy - During the final training phase of a language model guided by human feedback, both a policy (the language model itself) and a value function are updated in tandem. Which of the following statements best analyzes the distinct roles and update mechanisms of these two components in this joint optimization process?

Learn Before

Joint Optimization of Policy and Value Functions in RLHF

Multiple Choice

During the final training phase of a language model guided by human feedback, both a policy (the language model itself) and a value function are updated in tandem. Which of the following statements best analyzes the distinct roles and update mechanisms of these two components in this joint optimization process?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related