Case Study

Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot

You are leading model evaluation for an internal “meeting-transcript copilot” that ingests a single 120,000-token transcript (with many repeated names, agenda items, and side conversations) and must answer questions like: “What exact exception did Legal approve in the first 10 minutes?” The product requirement is: p95 time-to-first-token (TTFT) ≤ 1.2s and p95 end-to-end latency ≤ 6s at 50 concurrent users. Two candidate long-context LLMs are tested.

Results (same hardware, same serving stack):

  • Model A: Lower perplexity on a held-out set of long transcripts; Needle-in-a-haystack/passkey retrieval accuracy = 62% when the passkey is placed uniformly at random across the transcript; TTFT p95 = 0.9s; inter-token latency (ITL) p95 = 35 ms/token.
  • Model B: Higher (worse) perplexity on the same held-out transcripts; Needle-in-a-haystack/passkey retrieval accuracy = 91% (uniform placement); TTFT p95 = 1.6s; ITL p95 = 22 ms/token.

A senior stakeholder argues: “Perplexity is the most objective metric, and Model A is faster to first token, so we should ship Model A.”

As the evaluator, what decision do you recommend (Model A, Model B, or ‘neither—run additional evaluation first’), and justify it by explicitly connecting (1) why perplexity can be misleading for long-context capability, (2) what the needle-in-a-haystack/passkey results imply for this use case, and (3) how you would weigh TTFT vs ITL vs end-to-end latency/throughput to meet the stated SLOs? Provide a concise, defensible rationale that a product and engineering audience could act on.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related