Multiple Choice

In policy optimization, an importance-sampled surrogate objective is often used to approximate the true on-policy objective. A key mathematical property of this surrogate is that its gradient, when evaluated at the reference policy (i.e., the policy used to collect the data), is identical to the true on-policy policy gradient. What is the most significant implication of this property for the training process?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related