Essay

Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints

Your company is piloting a long-context “compliance copilot” that ingests a full 200-page policy manual plus a 6-month email thread (often 80k–120k tokens total) and must answer auditors’ questions by citing the exact sentence(s) that justify the answer. The product requirement is: (1) correct retrieval of the relevant clause even if it appears only once in the middle of the context, (2) end-to-end request latency under 8 seconds for 95% of requests, and (3) predictable cloud spend per 1,000 requests.

Two candidate models are proposed:

  • Model A: lower perplexity on a held-out corpus of similar manuals; higher TTFT and lower tokens-per-second.
  • Model B: slightly worse perplexity; faster TTFT and higher tokens-per-second.

Write an evaluation plan that would let you recommend A or B for this pilot. Your plan must (a) explain why perplexity alone can be misleading for this long-context use case, (b) specify at least one needle-in-a-haystack/passkey-style retrieval experiment you would run (including how you would vary the needle position and what constitutes success), (c) define the quality-focused metrics you would use for the auditor Q&A task (beyond perplexity) and how they relate to the retrieval experiment, and (d) define the efficiency metrics you would measure (e.g., TTFT, inter-token latency/tokens-per-second, throughput, resource utilization, energy/cost) and how you would trade them off against quality to make a final recommendation. Be explicit about what outcomes would cause you to choose A vs. B.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related