Concept

Advantage of Policy Gradients: Non-Differentiable Reward Functions

A significant advantage of the policy gradient method is that the cumulative reward function, R(τ), is not required to be differentiable. This is because the gradient is calculated with respect to the logarithm of the policy's probability, not the reward itself. This property allows for the application of any type of reward function in reinforcement learning, including those that are discontinuous or arbitrarily complex.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences