Policy Formula for LLMs in Reinforcement Learning
The policy for a Large Language Model is formally expressed as a conditional probability. Specifically, the policy of taking an action in a state , denoted as , is the probability of generating the next token given the input prompt and the sequence of previously generated tokens . The formula is: . In this formulation, the action corresponds to the predicted token , and the state corresponds to the context sequence .

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Policy Formula for LLMs in Reinforcement Learning
An autoregressive language model has processed the input 'The cat sat on the' and is now deciding the next word to generate. At this specific step, which of the following best describes the model's 'policy'?
Analyzing Language Model Generation Strategies
Nature of an LLM's Policy
Policy Formula for LLMs in Reinforcement Learning
A language model is generating a response to the prompt 'The best way to learn a new skill is to...'. So far, it has produced the sequence 'The best way to learn a new skill is to practice'. At this exact point in the generation process, what constitutes the model's next 'action' within a reinforcement learning framework?
Comparing 'Action' in Different Reinforcement Learning Scenarios
Identifying the Action in LLM Fine-Tuning
Learn After
An autoregressive language model is given the input prompt 'The weather today is' and has so far generated the token ' exceptionally'. The model is now deciding on the very next token to produce. In a reinforcement learning context where the model's policy is defined as the probability of taking an action 'a' in a state 's', which of the following correctly identifies the state and action for this specific decision-making step?
Dynamic State in LLM Policy
In the context of applying reinforcement learning to a language model, the model's strategy is defined by the policy formula: Match each component of this formulation to its correct description.