A research team is comparing two language models on a task that involves reading a 50-page story and then answering a question about a detail mentioned in the first chapter. Model A is specifically designed to handle very long texts, while Model B is a powerful general-purpose model. The team observes that Model B achieves a slightly lower (better) perplexity score across the entire 50-page text than Model A. However, Model A consistently answers the final question correctly, while Model B fails. What is the most likely reason for this discrepancy?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A research team is comparing two language models on a task that involves reading a 50-page story and then answering a question about a detail mentioned in the first chapter. Model A is specifically designed to handle very long texts, while Model B is a powerful general-purpose model. The team observes that Model B achieves a slightly lower (better) perplexity score across the entire 50-page text than Model A. However, Model A consistently answers the final question correctly, while Model B fails. What is the most likely reason for this discrepancy?
Evaluating a Model Selection Strategy
Explaining Perplexity's Limitation in Long-Context Evaluation
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant