1Cademy - In the final stage of training a language model with feedback, a policy and a value function are optimized concurrently. Match each component to its primary optimization objective and its role in this process.

Learn Before

Joint Optimization of Policy and Value Functions in RLHF

Matching

In the final stage of training a language model with feedback, a policy and a value function are optimized concurrently. Match each component to its primary optimization objective and its role in this process.

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Value Function Loss Minimization in RLHF
During the final training phase of a language model guided by human feedback, both a policy (the language model itself) and a value function are updated in tandem. Which of the following statements best analyzes the distinct roles and update mechanisms of these two components in this joint optimization process?
In the final stage of training a language model with feedback, a policy and a value function are optimized concurrently. Match each component to its primary optimization objective and its role in this process.
Value Model Update Frequency in RLHF
Advantage Function as TD Error in RLHF
Diagnosing Training Stagnation in Joint Optimization

Learn Before

Related