1Cademy - Reinforcement Learning from Human Feedback (RLHF)

Learn Before

Learning from Human Feedback
Aligning Large Language Models with Human Values

Activity (Process)

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is an alternative fine-tuning method for Large Language Models, introduced by Christiano et al. (2017) and later refined by Stiennon et al. (2020). It addresses the LLM alignment challenge by framing it as a reinforcement learning problem. The fundamental concept is that an LLM learns to align with human values by being trained on comparisons between different model outputs, using a reward signal derived from this human feedback to optimize its policy.