Learn Before
Formula

Policy Formula for LLMs in Reinforcement Learning

The policy for a Large Language Model is formally expressed as a conditional probability. Specifically, the policy of taking an action aa in a state ss, denoted as π(as)\pi(a|s), is the probability of generating the next token yty_t given the input prompt x\mathbf{x} and the sequence of previously generated tokens y<t\mathbf{y}_{< t}. The formula is: π(as)=Pr(ytx,y<t)\pi(a|s) = \Pr(y_t | \mathbf{x}, \mathbf{y}_{< t}). In this formulation, the action aa corresponds to the predicted token yty_t, and the state ss corresponds to the context sequence (x,y<t)(\mathbf{x}, \mathbf{y}_{< t}).

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences