1Cademy - Deconstructing the Reinforcement Learning Loss Function

Learn Before

Basic A2C Formulation for LLMs

Short Answer

Deconstructing the Reinforcement Learning Loss Function

A common loss function used to update a language model's policy, $\pi_{\theta}$ , is given by the formula: $\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})} [U(\mathbf{x}, \mathbf{y})]$ , where $U(\mathbf{x}, \mathbf{y})$ is a function that assigns a high score to desirable outputs. Analyze this formula and explain the specific purpose of two of its key components:

The negative sign ( $-$ ) at the beginning of the expression.
The expectation ( $\mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})}$ ) taken over the model's output distribution.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related