Case Study

Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints

You lead evaluation for an internal “Audit Evidence Finder” that ingests a single 120,000-token bundle (policies, emails, and change logs) and must answer: (1) the exact approval code for a specific change request and (2) the sentence that states the effective date of the related policy. In production, the tool must handle 2,000 such bundles/day with an SLO of TTFT ≤ 1.5s and average cost ≤ $0.03 per bundle. Two candidate models are tested:

  • Model A: Lower perplexity on a held-out corpus of similar documents; strong fluency in summaries. In a pilot, it often answers with plausible approval codes but is wrong when the code is located deep in the middle of the bundle.
  • Model B: Slightly worse perplexity; in synthetic tests it reliably retrieves a hidden passkey placed at random positions in long contexts, but it has higher TTFT and lower tokens/sec.

As the evaluator, decide which model you would recommend for launch and justify your decision by proposing a minimal evaluation package (no more than 4 metrics/tests total) that (a) directly measures long-context retrieval reliability beyond perplexity, (b) captures output quality relevant to auditability, and (c) quantifies whether the model can meet the efficiency SLOs. Your answer must explain how the chosen tests/metrics interact (e.g., what tradeoffs they reveal and why perplexity alone is insufficient here).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related