Short Answer

Addressing Data Mismatch in Policy Gradient Training

A reinforcement learning agent is trained using a policy gradient method. To be more data-efficient, it reuses experiences collected under a previous version of its policy. Explain the statistical challenge this creates when trying to evaluate the agent's current policy, and describe the conceptual purpose of the technique used to mitigate this challenge.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science