Learn Before
A research lab is evaluating several new long-context language models. Match each evaluation scenario described below with the primary methodological flaw it represents.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Narrow Focus of Current Evaluation Methods
Risk of Superficial Understanding in LLM Evaluation
Inadequacy of Datasets for Long-Context Evaluation
Confounding Factors in Long-Context LLM Evaluation
A research team designs a new benchmark to test a model's long-context capabilities. The test involves providing a model with a 100,000-word novel it has never seen before and then asking for a specific, unique detail mentioned only in the first chapter. The team claims that a model's ability to correctly answer this question is a strong indicator of its ability to process the entire text. Which of the following critiques represents the most significant flaw in this evaluation methodology?
Critiquing an LLM Evaluation Plan
A research lab is evaluating several new long-context language models. Match each evaluation scenario described below with the primary methodological flaw it represents.