1Cademy - Reinforcement Learning Process for LLMs

Learn Before

Fundamental Concepts for Reinforcement Learning

Activity (Process)

Reinforcement Learning Process for LLMs

In a reinforcement learning framework, the process of training a Large Language Model (LLM) iteratively evaluates and improves the model's policy. At each step $t$ , the current state $s_t$ is defined by the initial input prompt $\mathbf{x}$ and the tokens generated so far $\mathbf{y}_{<t}$ . The LLM acts as the policy, denoted by the predicted distribution $\Pr(y_t|\mathbf{x},\mathbf{y}_{<t})$ , to choose an action $a_t$ , which is the next token $y_t$ . After $y_t$ is predicted, a reward model evaluates the sequence $(\mathbf{x},\mathbf{y}_{<t}, y_t)$ to determine how well it aligns with the desired textual outcome. This evaluation produces reward scores that are then used to compute the value functions $V(s_t)$ and $Q(s_t,a_t)$ . Finally, these value functions provide the necessary feedback to guide the subsequent training and refinement of the LLM's policy.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After