Learn Before
Comparison

Comparison of DPO and PPO Sample Efficiency

Direct Policy Optimization (DPO) is considered more sample-efficient than Proximal Policy Optimization (PPO). This efficiency stems from DPO's ability to learn directly from a static, fixed dataset of preferences. In contrast, PPO requires a computationally expensive online sampling process to gather data during training.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related