logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Joint Optimization of Policy and Value Functions in RLHF

Concept icon
Concept

Value Model Update Frequency in RLHF

During the joint optimization phase of RLHF, the value model is updated at each token position within a generated output sequence, rather than only at the end of the sequence.

0

1

Concept icon
Updated 2025-10-07

Contributors are:

Gemini AI
Gemini AI
🏆 3

Who are from:

Google
Google
🏆 3

References


  • Reference of Foundations of Large Language Models Course

  • Reference of Foundations of Large Language Models Course

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Value Function Loss Minimization in RLHF

  • PPO Objective Formula for LLM Training in RLHF

  • During the final training phase of a language model guided by human feedback, both a policy (the language model itself) and a value function are updated in tandem. Which of the following statements best analyzes the distinct roles and update mechanisms of these two components in this joint optimization process?

  • In the final stage of training a language model with feedback, a policy and a value function are optimized concurrently. Match each component to its primary optimization objective and its role in this process.

  • Value Model Update Frequency in RLHF

    Concept icon
  • Advantage Function as TD Error in RLHF

  • Diagnosing Training Stagnation in Joint Optimization

Learn After
  • In a reinforcement learning process for training a language model, a 'value model' is used to estimate the expected future reward from any given point in a generated text sequence. What is the primary analytical reason for updating this value model's parameters after each token is generated, rather than only once at the end of the complete sequence?

  • Diagnosing Inefficient Language Model Training

  • During the iterative process of training a language model using human feedback, the component responsible for estimating future rewards (the 'value model') is only updated once, after an entire sequence of text has been fully generated.

logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




© 1Cademy 2026

We're committed to OpenSource on

Github