Learn Before
Modifying a Chatbot's Behavior
A team is fine-tuning a language model to act as a safer customer service chatbot. They adjust the model's parameters to decrease the probability of it generating responses that promise specific, unverified delivery dates. In the context of reinforcement learning where the model is an 'agent', what is being directly modified? Explain the relationship between the change in output probabilities and the agent's 'policy'.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A research team is training a language model to act as a helpful assistant using methods from reinforcement learning. One researcher is focused on analyzing the model's 'policy' (π) for generating a response given a user's query. Another researcher is analyzing the model's 'conditional probability distribution' (Pr) over all possible responses for the same query. What is the relationship between the 'policy' and the 'conditional probability distribution' in this context?
Modifying a Chatbot's Behavior
When applying reinforcement learning to a language model, the model's policy, denoted as π(y|x), is a separate computational function that is trained to approximate the model's core conditional probability distribution, Pr(y|x).