Case Study

Policy Update Analysis

An agent is being trained using an actor-critic method. The actor's objective is to adjust its policy, π, to maximize expected rewards by minimizing the following loss function: L(θ) = -E[log π(a|s) * A(s,a)], where A(s,a) is the advantage of taking action a in state s. In a particular state, the critic calculates the advantages for three possible actions as shown in the case study. Based on this single-step observation, which action's probability will the policy be most strongly encouraged to increase? Justify your answer by explaining how the components of the loss function drive this update.

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science