Multiple Choice

A reinforcement learning agent is being trained using a utility function that incorporates an upper-bound clip on the policy probability ratio, defined as min(ratio, 1+ε), where ε is a small positive constant. Consider two distinct actions taken during an episode:

  • Action A: Has a large positive advantage, and its probability ratio is 2.0.
  • Action B: Has a large negative advantage, and its probability ratio is 0.1.

Assuming ε = 0.2, how does this specific clipping mechanism influence the policy update derived from these two actions?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science