Multiple Choice

In the derivation of a policy gradient algorithm, we aim to update a policy based on actions taken within an episode. A core principle states that an action taken at a specific time step, tt, can only influence rewards received from that point forward (rt,rt+1,...r_t, r_{t+1}, ...). Given this principle, which of the following mathematical expressions correctly identifies the reward term that should be used to scale the gradient update for the action at time step tt?

0

1

Updated 2025-10-07

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science