1Cademy - A machine learning engineer is fine-tuning a large language model using a reinforcement learning approach. They mistakenly define the loss function to be minimized as $\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{x}\sim\mathcal{D}} \mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})} [U(\mathbf{x}, \mathbf{y})]$, where $U(\mathbf{x}, \mathbf{y})$ is a utility function that returns high values for desirable outputs and low values for undesirable ones. What is the most likely outcome of this training process?

Learn Before

Basic A2C Formulation for LLMs

Multiple Choice

A machine learning engineer is fine-tuning a large language model using a reinforcement learning approach. They mistakenly define the loss function to be minimized as $\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{x}\sim\mathcal{D}} \mathbb{E}_{\mathbf{y}\sim\pi_{\theta}(\cdot|\mathbf{x})} [U(\mathbf{x}, \mathbf{y})]$ , where $U(\mathbf{x}, \mathbf{y})$ is a utility function that returns high values for desirable outputs and low values for undesirable ones. What is the most likely outcome of this training process?

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related