Case Study

Diagnosing Policy Update Instability

An engineer is training a reinforcement learning agent using the policy update rule shown below, where π_θ(a_t|s_t) is the agent's policy and A(s_t, a_t) is the advantage function. Based on the components of this formula, what characteristic of the policy π_θ itself is the most likely cause of the observed instability? Explain your reasoning.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science