1Cademy - A machine learning engineer is training a reward model where the goal is to align the models predicted scores, $r(\mathbf{x}, \mathbf{y})$, with human-provided scores, $\varphi(\mathbf{x}, \mathbf{y})$. The standard approach is to maximize the objective function $\mathcal{L} = -\mathbb{E}[\varphi(\mathbf{x}, \mathbf{y}) - r(\mathbf{x}, \mathbf{y})]^2$. Suppose the engineer makes a mistake and instead configures the training process to *maximize* the standard mean squared error, effectively removing the negative sign from the objective: $\mathcal{L}_{mistake} = \mathbb{E}[\varphi(\mathbf{x}, \mathbf{y}) - r(\mathbf{x}, \mathbf{y})]^2$. What would be the most likely effect on the models behavior during training?

Learn Before

Negative Mean Squared Error Objective for Pointwise Reward Models

Multiple Choice

A machine learning engineer is training a reward model where the goal is to align the model's predicted scores, $r(\mathbf{x}, \mathbf{y})$ , with human-provided scores, $\varphi(\mathbf{x}, \mathbf{y})$ . The standard approach is to maximize the objective function $\mathcal{L} = -\mathbb{E}[\varphi(\mathbf{x}, \mathbf{y}) - r(\mathbf{x}, \mathbf{y})]^2$ . Suppose the engineer makes a mistake and instead configures the training process to maximize the standard mean squared error, effectively removing the negative sign from the objective: $\mathcal{L}_{mistake} = \mathbb{E}[\varphi(\mathbf{x}, \mathbf{y}) - r(\mathbf{x}, \mathbf{y})]^2$ . What would be the most likely effect on the model's behavior during training?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related