1Cademy - Language Model as a Stochastic Policy

Learn Before

Parameter Estimation via Conditional Log-Likelihood Maximization

Definition

Language Model as a Stochastic Policy

When applying reinforcement learning to sequence generation tasks, the language model itself is treated as the policy. The policy, denoted as π_θ, defines the probability of choosing the next token y_t given the input X and the previously generated tokens y_<t. This policy is directly equivalent to the conditional probability distribution of the language model, Pr_θ. The relationship is formally stated as: $\pi_{\theta}(y_t|\mathbf{X}, \mathbf{y}_{<t}) = \text{Pr}_{\theta}(y_t|\mathbf{X}, \mathbf{y}_{<t})$