Learn Before
Target Model (Policy Model) in RLHF
In the RLHF framework, the Target Model, also known as the policy model, is the Large Language Model being actively trained. Its policy, denoted as and formally defined as the probability distribution , governs the generation of the next token based on the current context. The model's parameters, , are updated during training under the guidance of both the reward and value models.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Architecture and Function of the RLHF Value Model
Target Model (Policy Model) in RLHF
Reference Policy Definition in RLHF
Architecture and Function of the RLHF Reward Model
A development team is building a system to align a large language model using reinforcement learning from human feedback. Their setup includes a target model for text generation, a reference model, a reward model to score outputs based on human preferences, and a value model to predict future rewards. For computational efficiency, they decide to build the reward model using a Convolutional Neural Network (CNN) and the value model using a Recurrent Neural Network (RNN), while keeping the target and reference models as Transformer decoders. What is the most significant architectural inconsistency in this design compared to a standard implementation?
LLM as the Agent in RLHF
An alignment process for a large language model uses a system composed of four distinct models, all sharing a common underlying architecture. Match each model component with its primary role in this system.
Architectural Consistency in Feedback-Based LLM Alignment
In a typical system for aligning a language model with human feedback, it is common practice to use a Transformer-based architecture for the text-generating models, while employing simpler, non-Transformer architectures for the reward and value models to reduce computational overhead.
Learn After
An engineering team is refining a large language model. During one step of the process, the model is given the start of a sentence, 'The best way to learn a new skill is'. The model then calculates a probability for every word in its vocabulary to be the next word and uses this distribution to generate a complete sentence. The model's internal parameters are then updated based on a separate quality assessment of the generated sentence. Which part of this process best describes the primary role of the model being actively trained (the 'policy model')?
Mechanism of Policy Model Refinement
Analyzing Policy Model Behavior