Definition

Language Model as a Stochastic Policy

When applying reinforcement learning to sequence generation tasks, the language model itself is treated as the policy. The policy, denoted as π_θ, defines the probability of choosing the next token y_t given the input X and the previously generated tokens y_<t. This policy is directly equivalent to the conditional probability distribution of the language model, Pr_θ. The relationship is formally stated as: πθ(ytX,y<t)=Prθ(ytX,y<t)\pi_{\theta}(y_t|\mathbf{X}, \mathbf{y}_{<t}) = \text{Pr}_{\theta}(y_t|\mathbf{X}, \mathbf{y}_{<t})

Image 0

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences