Activity (Process)

Reinforcement Learning Process for LLMs

In a reinforcement learning framework, the process of training a Large Language Model (LLM) iteratively evaluates and improves the model's policy. At each step tt, the current state sts_t is defined by the initial input prompt x\mathbf{x} and the tokens generated so far y<t\mathbf{y}_{<t}. The LLM acts as the policy, denoted by the predicted distribution Pr(ytx,y<t)\Pr(y_t|\mathbf{x},\mathbf{y}_{<t}), to choose an action ata_t, which is the next token yty_t. After yty_t is predicted, a reward model evaluates the sequence (x,y<t,yt)(\mathbf{x},\mathbf{y}_{<t}, y_t) to determine how well it aligns with the desired textual outcome. This evaluation produces reward scores that are then used to compute the value functions V(st)V(s_t) and Q(st,at)Q(s_t,a_t). Finally, these value functions provide the necessary feedback to guide the subsequent training and refinement of the LLM's policy.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences