Essay

Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature

Your company is piloting a long-context LLM feature that ingests a full 200-page customer contract plus a 30-page internal policy manual and then answers questions like: “Does this contract violate our policy on auto-renewal notice periods? Quote the exact clause and explain the mismatch.” Two candidate models are being compared.

Model A has better (lower) perplexity on a held-out corpus of long contracts. Model B has slightly worse perplexity but performs better on a synthetic “needle-in-a-haystack/passkey retrieval” test where a unique policy sentence is hidden at varying depths and the model must retrieve it exactly.

Write an evaluation argument recommending which evidence you would trust more for go/no-go and what additional measurements you would add before selecting a model. Your answer must explicitly:

  • Explain why perplexity can be misleading for long-context capability in this setting, and what it is (and is not) measuring.
  • Explain what needle-in-a-haystack/passkey retrieval does and does not validate about real contract+policy reasoning.
  • Propose a combined evaluation plan that balances quality-focused metrics (e.g., correctness of cited clauses, robustness to distractors) with efficiency metrics (e.g., TTFT, inter-token latency/throughput, and cost/resource use) for a high-volume enterprise rollout.
  • Describe at least one concrete tradeoff you would be willing to make (or not make) between output quality and efficiency, and justify it for this product scenario.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.5 Inference - Foundations of Large Language Models

Related