Activity (Process)

RLHF Component Interaction during Token Generation

In the Reinforcement Learning from Human Feedback (RLHF) process, several components interact at each step of text generation. Given an input x and a partially generated sequence y_{<t}, this combination forms the current state s_t. The policy, typically a Large Language Model (LLM), takes this state and produces an action a_t, which is the next token y_t. This state-action pair is then evaluated by a reward model, R(s_t, a_t), and the value functions, V(s_t) and Q(s_t, a_t), to generate feedback used for optimizing the policy.

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences