Concept

Basic A2C Formulation for LLMs

In Reinforcement Learning from Human Feedback (RLHF), we typically lack a human-annotated input-output dataset and instead rely on an input-only dataset, denoted as D\mathcal{D}. In this scenario, outputs are generated by sampling from the language model itself. The fundamental Advantage Actor-Critic (A2C) loss function is defined as L(θ)=ExDEyπθ(x)[U(x,y;θ)]\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ U(\mathbf{x},\mathbf{y};\theta) \big]. Here, yπθ(x)\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x}) indicates that the output sequence y\mathbf{y} is sampled according to the policy πθ\pi_{\theta}, and U(x,y;θ)U(\mathbf{x},\mathbf{y};\theta) is the utility function. While this formulation serves as a basis, more sophisticated reinforcement learning models are typically employed in practice.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related