1Cademy - Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant

Learn Before

Case Study

Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant

You are evaluating two candidate long-context LLMs (Model A and Model B) for an internal knowledge assistant that must answer questions using a single, very long context (up to 120k tokens) containing policies, incident postmortems, and runbooks. The assistant will be used in a live chat UI where users expect the first token quickly and the full answer within a few seconds.

Your team ran three evaluations:

Perplexity on a held-out corpus of long internal documents (lower is better): Model A = 9.8, Model B = 8.9.
A synthetic “needle-in-a-haystack/passkey retrieval” test: a unique passkey string is inserted once into the long context at a random position; the model must return it exactly. Success rate: Model A = 92%, Model B = 55%.
Inference efficiency on your target hardware with 120k-token prompts and ~200-token answers:
- Time to First Token (TTFT): Model A = 2.4s, Model B = 0.9s
- Inter-token latency (ITL): Model A = 35 ms/token, Model B = 55 ms/token
- Peak GPU memory: Model A = 72 GB, Model B = 48 GB

A product manager argues to pick Model B because it has better perplexity and much faster TTFT; a staff engineer argues to pick Model A because it “actually uses the whole context.”

As the evaluation owner, write a recommendation (pick A, pick B, or propose a gated/conditional deployment) that reconciles these results. Your answer must (a) explain why perplexity can be misleading for long-context capability in this setting, (b) use the needle/passkey results to interpret long-context retrieval behavior, and (c) explicitly weigh at least two efficiency metrics (e.g., TTFT vs ITL vs memory) against output-quality needs for the live chat use case.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related