1Cademy - Dual Learning Tasks of RLHF: Reward and Policy Learning

Learn Before

Reinforcement Learning from Human Feedback (RLHF)

Concept

Dual Learning Tasks of RLHF: Reward and Policy Learning

Reinforcement Learning from Human Feedback (RLHF) is fundamentally composed of two distinct learning stages. The first stage is reward model learning, where a model is trained to evaluate agent outputs based on human feedback. The second stage is policy learning, in which the agent's policy is optimized through reinforcement learning algorithms, using the trained reward model as a guide.

Updated 2026-04-20

Contributors are: