Interpreting Model Diagnostic Results
A machine learning team is evaluating their current text-generation model. For a set of prompts, they generate the top 5 possible responses. They then use a much more powerful, 'oracle' model to select the best response from those 5. The team observes that the oracle-selected response is, on average, only marginally better than their model's original top-ranked response, and both are frequently of low quality. Based on this outcome, what is the most likely category of error the team's model is exhibiting, and why?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Language Model Performance Diagnosis
A development team is analyzing the performance of their language model. For a set of test prompts, they take the top 5 responses generated by their model and have a significantly more powerful, 'oracle' model select the best response from that list. They find that the average quality score of their model's original top-ranked response is 70%, while the average quality score of the response selected by the oracle model is 95%. What does this large performance gap most strongly suggest about the primary limitation of the team's model?
Interpreting Model Diagnostic Results
Search Errors in LLMs