Multiple Choice

A reinforcement learning agent is being updated. The current policy is denoted by πθ\pi_{\theta}, and a batch of trajectory data has been collected using a previous, fixed policy, πθref\pi_{\theta_{\text{ref}}}. To improve the current policy using this existing data, the following objective function is optimized: L(θ)=Eτπθref[Prθ(τ)Prθref(τ)R(τ)]L(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{ref}}}} \left[ \frac{\text{Pr}_{\theta}(\tau)}{\text{Pr}_{\theta_{\text{ref}}}(\tau)} R(\tau) \right]. Which statement best analyzes the role of this objective function in the training process?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science