1Cademy - When applying reinforcement learning to a language model, the models policy, denoted as π(y|x), is a separate computational function that is trained to approximate the models core conditional probability distribution, Pr(y|x).

Learn Before

LLM Policy as a Probability Distribution

True/False

When applying reinforcement learning to a language model, the model's policy, denoted as π(y|x), is a separate computational function that is trained to approximate the model's core conditional probability distribution, Pr(y|x).

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Comprehension in Revised Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

A research team is training a language model to act as a helpful assistant using methods from reinforcement learning. One researcher is focused on analyzing the model's 'policy' (π) for generating a response given a user's query. Another researcher is analyzing the model's 'conditional probability distribution' (Pr) over all possible responses for the same query. What is the relationship between the 'policy' and the 'conditional probability distribution' in this context?
Modifying a Chatbot's Behavior
When applying reinforcement learning to a language model, the model's policy, denoted as π(y|x), is a separate computational function that is trained to approximate the model's core conditional probability distribution, Pr(y|x).
Policy Notation for Autoregressive Models ( $\pi_\theta$ )

Learn Before

Related