1Cademy - Calculating Approximated Policy Divergence

Learn Before

Approximated Policy Divergence Penalty Formula

Case Study

Calculating Approximated Policy Divergence

An agent follows a 3-step trajectory. The log-probabilities of the actions taken under the current policy (π_θ) and a reference policy (π_θ_ref) are recorded at each step. Based on the data in the table below, calculate the approximated policy divergence penalty for this trajectory.

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences