Multiple Choice

In policy gradient methods, the gradient of the performance objective is estimated as an expectation over trajectories. Each trajectory's contribution to this estimate is the product of its cumulative reward and the gradient of its log-probability. Given this structure, why can these methods effectively handle tasks with non-differentiable reward functions, such as a simple binary reward for winning or losing a game?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science