Multiple Choice

An agent is being trained using a policy gradient method. The theoretical objective gradient is expressed as an expectation over trajectories τ sampled from the policy π_θ:

∇J(θ) = E_{τ~π_θ}[ (∇_θ log Pr_θ(τ)) R(τ) ]

In practice, this is estimated from a batch of |D| sampled trajectories using the following formula:

∇J(θ) ≈ (1/|D|) Σ_{τ∈D} (∇_θ log Pr_θ(τ)) R(τ)

What key assumption allows for the transition from the theoretical expectation to this practical sample mean estimator?

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science