Language Model as a Stochastic Policy
When applying reinforcement learning to sequence generation tasks, the language model itself is treated as the policy. The policy, denoted as π_θ, defines the probability of choosing the next token y_t given the input X and the previously generated tokens y_<t. This policy is directly equivalent to the conditional probability distribution of the language model, Pr_θ. The relationship is formally stated as:

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Language Model as a Stochastic Policy
Plackett-Luce Loss Function
A model is being trained by maximizing the sum of log-probabilities for a dataset of 1,000 examples. Consider two scenarios for a single training update:
Scenario A: The probability assigned to the correct output for one example improves from 0.1 to 0.2. The probabilities for all other 999 examples remain unchanged.
Scenario B: The probability assigned to the correct output for one example improves from 0.8 to 0.9. The probabilities for all other 999 examples remain unchanged.
Which scenario leads to a larger increase in the overall training objective function, and why?
Model Comparison using Conditional Log-Likelihood
Evaluating a Training Update
Learn After
Policy Gradient Utility for Sequence Generation
A language model is tasked with generating a sentence. After producing the partial sequence 'The cat sat on the', it computes the following probability distribution for the next word: {'mat': 0.7, 'chair': 0.2, 'roof': 0.1}. If we frame this generation process using reinforcement learning, how is this probability distribution correctly interpreted?
Equivalence of Language Model and Policy
Conceptual Error in RL Fine-Tuning