Learn Before
A team is developing a safety filter for a language model. Their goal is to prevent the model from generating text that falls into several strictly prohibited categories (e.g., revealing private data, generating hate speech). For fine-grained feedback, they evaluate each model response by breaking it into smaller segments. Which evaluation strategy would be most effective for this specific goal, and why?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Training Reward Models with Classification Loss for Segment Alignment
A team is developing a safety filter for a language model. Their goal is to prevent the model from generating text that falls into several strictly prohibited categories (e.g., revealing private data, generating hate speech). For fine-grained feedback, they evaluate each model response by breaking it into smaller segments. Which evaluation strategy would be most effective for this specific goal, and why?
Segment Evaluation Methods
Improving Content Moderation Feedback
Notation for Ground Truth Labels in Segment Classification