1Cademy - Policy Formula for LLMs in Reinforcement Learning

Learn Before

Policy in the Context of LLMs
Action in the Context of LLMs

Formula

Policy Formula for LLMs in Reinforcement Learning

The policy for a Large Language Model is formally expressed as a conditional probability. Specifically, the policy of taking an action $a$ in a state $s$ , denoted as $\pi(a|s)$ , is the probability of generating the next token $y_t$ given the input prompt $\mathbf{x}$ and the sequence of previously generated tokens $\mathbf{y}_{< t}$ . The formula is: $\pi(a|s) = \Pr(y_t | \mathbf{x}, \mathbf{y}_{< t})$ . In this formulation, the action $a$ corresponds to the predicted token $y_t$ , and the state $s$ corresponds to the context sequence $(\mathbf{x}, \mathbf{y}_{< t})$ .

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

An autoregressive language model is given the input prompt 'The weather today is' and has so far generated the token ' exceptionally'. The model is now deciding on the very next token to produce. In a reinforcement learning context where the model's policy is defined as the probability of taking an action 'a' in a state 's', which of the following correctly identifies the state and action for this specific decision-making step?
Dynamic State in LLM Policy
In the context of applying reinforcement learning to a language model, the model's strategy is defined by the policy formula: $\pi(a|s) = \text{Pr}(y_t | \mathbf{x}, \mathbf{y}_{<t})$ Match each component of this formulation to its correct description.

Learn Before

Related

Learn After