Concept

Reusing Transformer Training for Reward Models

Because the reward model used in alignment is itself a Large Language Model (LLM), its optimization process can directly leverage standard Transformer training procedures. The primary modification required is switching out the traditional cross-entropy loss function, which is typically used for standard LLM pre-training, with a pairwise comparison loss function to learn from human preference data.

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences