1Cademy - Insufficiency of Data Fitting for Complex Value Alignment

Learn Before

Human Preference Alignment via Reward Models

Concept

Insufficiency of Data Fitting for Complex Value Alignment

Aligning LLMs with complex human values is not merely a data-fitting task. Limited, human-annotated samples are often insufficient to describe the full range of desired behaviors. The core objective is to teach the model a general capability to determine which outputs are more aligned with human preferences, rather than just having it replicate a fixed set of examples.

Updated 2025-10-06

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

A development team aims to align a large language model with the complex value of 'being helpful'. Their strategy is to create a high-quality dataset of 50,000 question-and-answer pairs where the model's response is rated as 'very helpful' by human annotators. They then fine-tune the model with the sole objective of maximizing its ability to reproduce these exact 'very helpful' answers. Which statement best evaluates the fundamental limitation of this data-fitting approach for achieving the team
Analysis of an LLM Alignment Failure
Limitations of Supervised Fine-Tuning for Value Alignment

Learn Before

Related

Learn After