A team is training a conversational agent to be more helpful. Their strategy involves having a human user interact with the agent. After each response from the agent, the human provides a numerical score indicating its quality. This score is immediately used as a signal to update the agent's internal strategy for generating the next response. This direct-feedback loop is repeated thousands of times. The team observes that this training process is prohibitively slow and costly. Based on the typical two-stage process for this kind of training, what is the most significant flaw in the team's approach?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Reward Model Learning in RLHF
A team is training a conversational agent to be more helpful. Their strategy involves having a human user interact with the agent. After each response from the agent, the human provides a numerical score indicating its quality. This score is immediately used as a signal to update the agent's internal strategy for generating the next response. This direct-feedback loop is repeated thousands of times. The team observes that this training process is prohibitively slow and costly. Based on the typical two-stage process for this kind of training, what is the most significant flaw in the team's approach?
A common method for aligning a language model with human preferences involves two major phases. Arrange the following descriptions of these phases in the correct chronological order.
A team is implementing a system to align a language model with human preferences. The process involves several distinct activities. Match each activity described below to the primary learning stage it belongs to.
Diagnosing Flawed AI Training