1Cademy - LLM Policy as a Probability Distribution

Learn Before

Text Generation Probability
LLM as the Agent in RLHF

Definition

LLM Policy as a Probability Distribution

In the context of reinforcement learning, the policy of a Large Language Model agent is the model's probability distribution over possible outputs. This policy, often denoted by $\pi$ , is equivalent to the conditional probability of generating an output sequence $\mathbf{y}$ given an input context $\mathbf{x}$ . This relationship is expressed as $\pi(\mathbf{y}|\mathbf{x}) = \Pr(\mathbf{y}|\mathbf{x})$ .