Essay

Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant

Your company is piloting an internal “policy & incident” assistant that must read a single 120,000-token case file (emails, logs, policies) and answer questions like: “What exception clause applies to this incident, and what is the required escalation path?” Two candidate long-context LLMs are being compared:

  • Model A: lower perplexity on a held-out corpus of long documents; higher GPU memory use; slower time-to-first-token (TTFT) but similar tokens-per-second (TPS) once generation starts.
  • Model B: slightly worse perplexity; passes a needle-in-a-haystack/passkey retrieval test at 120,000 tokens with high accuracy across many insertion positions; lower memory use; faster TTFT.

Write a recommendation memo (as if to an engineering + product review board) that proposes an evaluation plan and a deployment choice. Your memo must (1) explain why perplexity can be misleading for this long-context use case, (2) specify at least two quality-focused metrics beyond perplexity that you would use to validate end-to-end usefulness on real tasks, (3) specify at least two efficiency metrics you would track and how they affect user experience and cost at this context length, and (4) justify how you would use (and limit) needle-in-a-haystack/passkey retrieval results when deciding between the models. Conclude with a clear decision and the tradeoffs you are accepting.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related