Formula

Objective Function as Expected Cumulative Reward (Performance Function)

In reinforcement learning, the objective function J(θ)J(\theta), also known as the performance function, evaluates the effectiveness of a policy πθ\pi_\theta parameterized by θ\theta. It is defined as the expected cumulative reward over all possible trajectories τ\tau. The formula is commonly expressed as: J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)] The notation τπθ\tau \sim \pi_{\theta} signifies that the trajectory τ\tau is generated by following the policy πθ\pi_{\theta}. Alternatively, this objective can be written as a sum over the space of all trajectories D\mathcal{D}, weighted by the probability of each trajectory under the policy: J(θ)=τDPrθ(τ)R(τ)J(\theta) = \sum_{\tau \in \mathcal{D}} \mathrm{Pr}_{\theta}(\tau)R(\tau) Here, R(τ)=t=1TrtR(\tau) = \sum_{t=1}^{T} r_t is the cumulative reward for a trajectory.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related