Multiple Choice

In a policy gradient algorithm, a common technique to stabilize learning is to subtract a calculated value from the total reward of each trajectory before computing the update. This is done to reduce the variability of the updates without altering their expected direction. Which of the following calculated values, if subtracted from the total reward, would introduce an incorrect bias and potentially lead the policy updates in the wrong direction on average?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science