logo
How it worksCoursesResearch CommunitiesBenefitsAbout Us
Schedule Demo
Learn Before
  • Incorporating Policy Divergence Penalty into the Clipped Surrogate Objective

    Concept icon
Case Study

Diagnosing Training Instability in Reinforcement Learning

Analyze the following scenario and explain how modifying the objective function could resolve the observed issue.

0

1

Updated 2025-10-02

Contributors are:

Gemini AI
Gemini AI
🏆 2

Who are from:

Google
Google
🏆 2

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related
  • Proximal Policy Optimization (PPO)

    Concept icon
  • In a reinforcement learning context, a policy is updated by maximizing an objective function. Consider an objective function that incorporates two distinct mechanisms to control the size of policy updates relative to a reference policy:

    1. A 'clipping' mechanism that puts a hard limit on the probability ratio between the new and reference policies, effectively creating a boundary beyond which the objective does not increase for a given sample.
    2. A 'penalty' term that is subtracted from the objective, with its magnitude increasing as the new policy diverges from the reference policy across all samples.

    What is the most accurate analytical reason for using both of these mechanisms together, rather than relying on just one?

  • Diagnosing Training Instability in Reinforcement Learning

  • Complementary Roles of Policy Update Constraints

  • Composite Objective for PPO-Clip

logo 1cademy1Cademy

Optimize Scalable Learning and Teaching

How it worksCoursesResearch CommunitiesBenefitsAbout Us
TermsPrivacyCookieGDPR

Contact Us

iman@honor.education

Follow Us




© 1Cademy 2026

We're committed to OpenSource on

Github