Learn Before
RLHF Component Interaction during Token Generation
In the Reinforcement Learning from Human Feedback (RLHF) process, several components interact at each step of text generation. Given an input x and a partially generated sequence y_{<t}, this combination forms the current state s_t. The policy, typically a Large Language Model (LLM), takes this state and produces an action a_t, which is the next token y_t. This state-action pair is then evaluated by a reward model, R(s_t, a_t), and the value functions, V(s_t) and Q(s_t, a_t), to generate feedback used for optimizing the policy.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Bellman Equation
State-Value Function (V) Formula
An agent is in a state
sand must choose between two actions:AandB. According to the agent's current policy, it chooses actionAwith a 70% probability and actionBwith a 30% probability. The expected total future reward for taking actionAfrom statesis +20. The expected total future reward for taking actionBfrom statesis -10. Based on this information, which of the following statements correctly describes the relationship between the value of being in statesand the values of taking each action?An agent is learning to navigate a complex environment. Match each of the following questions the agent might have with the type of value function that would most directly provide the answer.
RLHF Component Interaction during Token Generation
Action-Value Function Definition
Drone Navigation Decision Analysis
Advantage Function in Terms of Q-values and V-values
Learn After
A language model fine-tuned using feedback is in the middle of generating a response. For a single, specific token to be chosen and its quality assessed, several internal events must occur. Arrange the following events in the correct chronological order for one generation step.
An RLHF-tuned language model has generated the partial sentence: 'The best way to learn is by'. The model's policy is now considering 'doing' as the next token. Which statement best analyzes the interaction of the core components at this specific moment of generation?
Diagnosing Component Outputs in Text Generation