Learn Before
Evaluating Model Alignment Strategies
A company has developed a large language model to act as a creative writing assistant. During testing, they find that the model occasionally generates content that is unoriginal or relies heavily on harmful stereotypes. The team proposes two different methods to steer the model towards safer and more appropriate outputs:
Method A: Create a comprehensive set of strict rules and filters that block the model from generating text containing specific keywords or phrases associated with stereotypes and plagiarism.
Method B: Collect thousands of the model's outputs and have a diverse group of human reviewers rate each output on a scale of 'safe and original' to 'harmful or unoriginal'. Use these human ratings to further train the model to prefer generating outputs that receive high scores.
Critique both methods. Which method is more likely to be effective in the long term for aligning the model with the nuanced goal of producing safe and responsible creative content? Justify your decision by comparing the potential effectiveness, limitations, and scalability of each approach.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating Model Alignment Strategies
A technology company develops a powerful language model for public use. They discover that when asked certain questions, the model occasionally generates detailed, unsafe instructions. To address this safety concern, the company decides to use a process of alignment guided by human input. Which of the following actions best exemplifies this alignment process?
Critique of Human-Guided LLM Alignment for Safety