Concept

A2C Loss Function Formulation

In the Advantage Actor-Critic (A2C) algorithm, the loss function is constructed based on the policy gradient objective that uses the advantage function. This objective, often expressed as a utility function U(τ;θ)=t=1Tlogπθ(atst)A(st,at)U(\tau; \theta) = \sum_{t=1}^{T} \log \pi_{\theta}(a_t|s_t)A(s_t, a_t), forms the core of the actor's loss, which is minimized during training to improve the policy. By maximizing the utility over sampled trajectories τ\tau, the model adjusts its policy to select actions with higher advantages.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related