Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Your company is piloting an internal “policy & incident” assistant that must read a single 120,000-token case file (emails, logs, policies) and answer questions like: “What exception clause applies to this incident, and what is the required escalation path?” Two candidate long-context LLMs are being compared:
- Model A: lower perplexity on a held-out corpus of long documents; higher GPU memory use; slower time-to-first-token (TTFT) but similar tokens-per-second (TPS) once generation starts.
- Model B: slightly worse perplexity; passes a needle-in-a-haystack/passkey retrieval test at 120,000 tokens with high accuracy across many insertion positions; lower memory use; faster TTFT.
Write a recommendation memo (as if to an engineering + product review board) that proposes an evaluation plan and a deployment choice. Your memo must (1) explain why perplexity can be misleading for this long-context use case, (2) specify at least two quality-focused metrics beyond perplexity that you would use to validate end-to-end usefulness on real tasks, (3) specify at least two efficiency metrics you would track and how they affect user experience and cost at this context length, and (4) justify how you would use (and limit) needle-in-a-haystack/passkey retrieval results when deciding between the models. Conclude with a clear decision and the tradeoffs you are accepting.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.5 Inference - Foundations of Large Language Models
Related
Comparison Between Long-Context LLM Evaluation and Traditional Long-Range Dependency Evaluation
Need for New Benchmarks and Metrics for Long-Context LLMs
Challenges in Evaluating Long-Context LLMs
A researcher is designing a test to evaluate a new language model's ability to process long documents. The test involves inserting a single, unique sentence, 'The most effective shade of blue for a widget is cerulean,' into a 100,000-word document. The researcher consistently places this sentence within the first 1,000 words of the document and then asks the model, 'What is the most effective shade of blue for a widget?' The model is considered successful if it answers 'cerulean.' Which of the following statements best analyzes the primary limitation of this evaluation approach?
Evaluating a Chatbot's Long-Term Memory
Comparing Methodologies for Long-Context LLM Assessment
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
You are evaluating two candidate long-context LLMs...
Your team is writing an internal evaluation checkl...
You lead evaluation for an internal eDiscovery ass...
Your team is selecting an LLM for an internal "pol...
Request Latency
Throughput
Time to First Token (TTFT)
Inter-token Latency (ITL)
Tokens Per Second (TPS)
Resource Utilization in LLM Inference
Energy Efficiency in LLM Inference
Cost Efficiency in LLM Inference
A startup is building a real-time, interactive chatbot to help customers troubleshoot technical issues. Their engineering team evaluates two different language models, 'Model X' and 'Model Y'. The team's final report concludes that Model X is superior because its responses are consistently more accurate and helpful across a wide range of test queries. Based on this report, the company decides to deploy Model X. Which of the following statements identifies the most critical potential weakness in the team's evaluation process for this specific use case?
LLM Selection for a High-Volume Chatbot
A team is evaluating a large language model for deployment. Match each evaluation goal below to the primary category of metric it represents: 'Output Quality' or 'Efficiency'.
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
Accuracy-Based Metrics for LLM Evaluation
Robustness Evaluation of LLMs
Usability Evaluation of LLMs
Ethical and Fairness Metrics for LLM Evaluation
A team is developing a large language model intended to function as a creative writing partner, helping authors overcome writer's block by generating novel plot twists and imaginative character descriptions. The primary goal is to produce outputs that are inspiring, engaging, and stylistically varied. Given this primary goal, which of the following evaluation approaches should the team prioritize to best measure the model's success?
An LLM development team is conducting a comprehensive evaluation of their new model. Match each evaluation goal with the specific quality dimension it is designed to assess.
LLM Selection for a Customer Service Application
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
A research team is comparing two language models on a task that involves reading a 50-page story and then answering a question about a detail mentioned in the first chapter. Model A is specifically designed to handle very long texts, while Model B is a powerful general-purpose model. The team observes that Model B achieves a slightly lower (better) perplexity score across the entire 50-page text than Model A. However, Model A consistently answers the final question correctly, while Model B fails. What is the most likely reason for this discrepancy?
Evaluating a Model Selection Strategy
Explaining Perplexity's Limitation in Long-Context Evaluation
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
An AI research team is testing a new large language model's long-context capabilities. They create a test where a unique, non-obvious fact ('The most common color for a fire hydrant in Iceland is bright yellow') is inserted into different locations within a very long, unrelated document. The model is then prompted to retrieve this specific fact. The team observes that the model successfully retrieves the fact when it's placed near the beginning or the end of the document, but consistently fails to retrieve it when it's placed in the middle sections. What does this experimental result most strongly suggest about the model's performance?
Critique of a Synthetic Retrieval Task
Designing a Long-Context Retrieval Experiment
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant