1Cademy - A language models policy, $\pi_{\theta}$, is being updated by minimizing the loss function $\mathcal{L}(\theta) = -\mathbb{E}[U(\mathbf{x}, \mathbf{y})]$, where $\mathbf{x}$ is a given input, $\mathbf{y}$ is an output generated by the model, and $U(\mathbf{x}, \mathbf{y})$ is a utility function that assigns a high score to desirable outputs and a low score to undesirable ones. What is the direct consequence of minimizing this loss function on the models behavior?

Learn Before

Basic A2C Formulation for LLMs

Multiple Choice

A language model's policy, $\pi_{\theta}$ , is being updated by minimizing the loss function $\mathcal{L}(\theta) = -\mathbb{E}[U(\mathbf{x}, \mathbf{y})]$ , where $\mathbf{x}$ is a given input, $\mathbf{y}$ is an output generated by the model, and $U(\mathbf{x}, \mathbf{y})$ is a utility function that assigns a high score to desirable outputs and a low score to undesirable ones. What is the direct consequence of minimizing this loss function on the model's behavior?

Updated 2025-09-28

Contributors are:

Who are from:

Learn Before

Related