1Cademy - Policy Proportional to Exponentiated Reward

Learn Before

Worth Function in Plackett-Luce for RLHF Reward Modeling

Formula

Policy Proportional to Exponentiated Reward

A policy, denoted as $\pi(y|x)$ , can be modeled such that the probability of generating an output $y$ given an input $x$ is proportional to the exponential of a reward function $r(x, y)$ . This fundamental relationship is expressed as: $\pi(y|x) \propto \exp(r(x, y))$ This formulation ensures that outputs with higher reward scores are assigned exponentially higher probabilities, forming the basis for converting learned rewards into a usable probability distribution, which is then typically normalized.