Learn Before
A research team is training a language model to act as a helpful assistant using methods from reinforcement learning. One researcher is focused on analyzing the model's 'policy' (π) for generating a response given a user's query. Another researcher is analyzing the model's 'conditional probability distribution' (Pr) over all possible responses for the same query. What is the relationship between the 'policy' and the 'conditional probability distribution' in this context?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A research team is training a language model to act as a helpful assistant using methods from reinforcement learning. One researcher is focused on analyzing the model's 'policy' (π) for generating a response given a user's query. Another researcher is analyzing the model's 'conditional probability distribution' (Pr) over all possible responses for the same query. What is the relationship between the 'policy' and the 'conditional probability distribution' in this context?
Modifying a Chatbot's Behavior
When applying reinforcement learning to a language model, the model's policy, denoted as π(y|x), is a separate computational function that is trained to approximate the model's core conditional probability distribution, Pr(y|x).