1Cademy - A research team is comparing two language models on a task that involves reading a 50-page story and then answering a question about a detail mentioned in the first chapter. Model A is specifically designed to handle very long texts, while Model B is a powerful general-purpose model. The team observes that Model B achieves a slightly lower (better) perplexity score across the entire 50-page text than Model A. However, Model A consistently answers the final question correctly, while Model B fails. What is the most likely reason for this discrepancy?

Learn Before

Limitation of Perplexity for Evaluating Long-Context LLMs

Multiple Choice

A research team is comparing two language models on a task that involves reading a 50-page story and then answering a question about a detail mentioned in the first chapter. Model A is specifically designed to handle very long texts, while Model B is a powerful general-purpose model. The team observes that Model B achieves a slightly lower (better) perplexity score across the entire 50-page text than Model A. However, Model A consistently answers the final question correctly, while Model B fails. What is the most likely reason for this discrepancy?

Updated 2025-09-26

Contributors are:

Who are from:

Learn Before

Related