1Cademy - Basic A2C Formulation for LLMs

Learn Before

Concept

Basic A2C Formulation for LLMs

In Reinforcement Learning from Human Feedback (RLHF), we typically lack a human-annotated input-output dataset and instead rely on an input-only dataset, denoted as $\mathcal{D}$ . In this scenario, outputs are generated by sampling from the language model itself. The fundamental Advantage Actor-Critic (A2C) loss function is defined as $\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \mathbb{E}_{\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})} \big[ U(\mathbf{x},\mathbf{y};\theta) \big]$ . Here, $\mathbf{y} \sim \pi_{\theta}(\cdot|\mathbf{x})$ indicates that the output sequence $\mathbf{y}$ is sampled according to the policy $\pi_{\theta}$ , and $U(\mathbf{x},\mathbf{y};\theta)$ is the utility function. While this formulation serves as a basis, more sophisticated reinforcement learning models are typically employed in practice.

Updated 2026-05-02

Contributors are:

Who are from:

References

Learn Before

Related

Learn After