Formula

Policy Proportional to Exponentiated Reward

A policy, denoted as π(yx)\pi(y|x), can be modeled such that the probability of generating an output yy given an input xx is proportional to the exponential of a reward function r(x,y)r(x, y). This fundamental relationship is expressed as: π(yx)exp(r(x,y))\pi(y|x) \propto \exp(r(x, y)) This formulation ensures that outputs with higher reward scores are assigned exponentially higher probabilities, forming the basis for converting learned rewards into a usable probability distribution, which is then typically normalized.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related