1Cademy - LLM Policy as a Probability Distribution

Learn Before

Text Generation Probability
LLM as the Agent in RLHF

Definition

LLM Policy as a Probability Distribution

In the context of reinforcement learning, the policy of a Large Language Model agent is the model's probability distribution over possible outputs. This policy, often denoted by $\pi$ , is equivalent to the conditional probability of generating an output sequence 'y' given an input context 'x'. This relationship is expressed as $\pi(y|x) = Pr(y|x)$ .