1Cademy - Alignment as a Segment Classification Problem

Learn Before

Segment-Based Reward Computation

Concept

Alignment as a Segment Classification Problem

In certain alignment tasks, such as evaluating ethical considerations, the problem can be framed as a classification task at the segment level. Instead of assigning a continuous score, each segment of a response is categorized into discrete classes, for instance, 'ethical' or 'unethical'. These labels can be assigned by human annotators or by automated classifiers.

Updated 2026-05-03

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

Training Reward Models with Classification Loss for Segment Alignment
A team is developing a safety filter for a language model. Their goal is to prevent the model from generating text that falls into several strictly prohibited categories (e.g., revealing private data, generating hate speech). For fine-grained feedback, they evaluate each model response by breaking it into smaller segments. Which evaluation strategy would be most effective for this specific goal, and why?
Segment Evaluation Methods
Improving Content Moderation Feedback
Notation for Ground Truth Labels in Segment Classification

Learn Before

Related

Learn After