Formula

On-Policy Objective Function (Performance Measure)

The performance of a policy πθ\pi_\theta in reinforcement learning is measured by an objective function, J(θ)J(\theta). This function is defined as the expected cumulative reward, R(τ)R(\tau), over the distribution of trajectories τ\tau generated by following the policy. The goal of the agent is to find the policy parameters θ\theta that maximize this value. The formula is: J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences