Case Study

Calculating Approximated Policy Divergence

An agent follows a 3-step trajectory. The log-probabilities of the actions taken under the current policy (π_θ) and a reference policy (π_θ_ref) are recorded at each step. Based on the data in the table below, calculate the approximated policy divergence penalty for this trajectory.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science