Learn Before
A researcher is designing a test to evaluate a new language model's ability to process long documents. The test involves inserting a single, unique sentence, 'The most effective shade of blue for a widget is cerulean,' into a 100,000-word document. The researcher consistently places this sentence within the first 1,000 words of the document and then asks the model, 'What is the most effective shade of blue for a widget?' The model is considered successful if it answers 'cerulean.' Which of the following statements best analyzes the primary limitation of this evaluation approach?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Comparison Between Long-Context LLM Evaluation and Traditional Long-Range Dependency Evaluation
Need for New Benchmarks and Metrics for Long-Context LLMs
Challenges in Evaluating Long-Context LLMs
A researcher is designing a test to evaluate a new language model's ability to process long documents. The test involves inserting a single, unique sentence, 'The most effective shade of blue for a widget is cerulean,' into a 100,000-word document. The researcher consistently places this sentence within the first 1,000 words of the document and then asks the model, 'What is the most effective shade of blue for a widget?' The model is considered successful if it answers 'cerulean.' Which of the following statements best analyzes the primary limitation of this evaluation approach?
Evaluating a Chatbot's Long-Term Memory
Comparing Methodologies for Long-Context LLM Assessment
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
You are evaluating two candidate long-context LLMs...
Your team is writing an internal evaluation checkl...
You lead evaluation for an internal eDiscovery ass...
Your team is selecting an LLM for an internal "pol...