1Cademy - RLHF Component Interaction during Token Generation

Learn Before

State-Value and Action-Value Functions

Activity (Process)

RLHF Component Interaction during Token Generation

In the Reinforcement Learning from Human Feedback (RLHF) process, several components interact at each step of text generation. Given an input x and a partially generated sequence y_{<t}, this combination forms the current state s_t. The policy, typically a Large Language Model (LLM), takes this state and produces an action a_t, which is the next token y_t. This state-action pair is then evaluated by a reward model, R(s_t, a_t), and the value functions, V(s_t) and Q(s_t, a_t), to generate feedback used for optimizing the policy.