Case Study

Post-Deployment Alignment Update: Choosing Between DPO and RLHF Under Logging and Compute Constraints

You are the alignment lead for an enterprise LLM used in a regulated customer-support product. You must ship an alignment update in 10 days. Due to policy, you are not allowed to run an online training loop that repeatedly samples new model outputs during training; you may only train on a fixed dataset that has already been collected. The dataset contains tuples (x, y_chosen, y_rejected) from human reviewers. You also have a frozen reference model π_ref (the last production model). For each tuple, your logging system can provide the log-probabilities log π_ref(y|x) for both y_chosen and y_rejected, and during training you can compute log π_θ(y|x) for the current model π_θ.

A senior engineer proposes: “Let’s follow the classic RLHF pipeline anyway: train a reward model on the preference pairs, then run PPO with that reward model. If we can’t do online sampling, we’ll just reuse the same (x, y_chosen, y_rejected) pairs as ‘trajectories’ in PPO.”

As the decision-maker, analyze this proposal and recommend a training approach. In your answer, you must (1) explain why the DPO preference probability can be computed without an explicit reward model by using policy ratios against π_ref (including what cancels out conceptually), and (2) use that fact to justify why DPO fits the fixed-dataset constraint as an offline RL method better than trying to force PPO/RLHF into an offline setting with the same data. Conclude with one concrete risk/tradeoff your recommended approach introduces compared to the alternative.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related