1Cademy - Establishing the Initial Policy in RLHF

Learn Before

Four-Stage Process of Reinforcement Learning from Human Feedback (RLHF)

Concept

Establishing the Initial Policy in RLHF

The starting point for Reinforcement Learning from Human Feedback (RLHF) is an initial policy, which is an LLM that has already undergone pre-training and instruction fine-tuning. This model is considered the version that would be deployed to interact with users and respond to their requests, forming the baseline for further alignment.

Updated 2026-04-20

Contributors are: