Multiple Choice

An agent is being trained using a policy gradient method. A batch of data D is collected, containing exactly two trajectories, τ_1 and τ_2.

  • Trajectory τ_1 has a total reward R(τ_1) = 10.
  • Trajectory τ_2 has a total reward R(τ_2) = -5.

The gradient of the log-probability for each trajectory with respect to the policy parameters θ is denoted as ∇_θ log Pr_θ(τ_1) and ∇_θ log Pr_θ(τ_2), respectively.

Based on the standard practical estimator for the policy gradient, which of the following expressions correctly represents the estimated gradient ∇J(θ) for this batch?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science