Multiple Choice

In a reinforcement learning scenario, the performance of a new policy, defined by parameters θ, is often estimated using an objective function that relies on data collected from a reference policy, defined by parameters θ_ref. This objective function is given by: J(θ)=Eτπθref[Prθ(τ)Prθref(τ)R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right] where τ represents a trajectory, Pr(τ) is the probability of that trajectory, and R(τ) is its total reward. Which of the following statements most accurately evaluates the relationship between this objective function, J(θ), and the true expected reward of the reference policy, Eτπθref[R(τ)]\mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} [R(\tau)]?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related