1Cademy - Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature

Learn Before

Essay

Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature

Your company is piloting a long-context LLM feature that ingests a full 200-page customer contract plus a 30-page internal policy manual and then answers questions like: “Does this contract violate our policy on auto-renewal notice periods? Quote the exact clause and explain the mismatch.” Two candidate models are being compared.

Model A has better (lower) perplexity on a held-out corpus of long contracts. Model B has slightly worse perplexity but performs better on a synthetic “needle-in-a-haystack/passkey retrieval” test where a unique policy sentence is hidden at varying depths and the model must retrieve it exactly.

Write an evaluation argument recommending which evidence you would trust more for go/no-go and what additional measurements you would add before selecting a model. Your answer must explicitly:

Explain why perplexity can be misleading for long-context capability in this setting, and what it is (and is not) measuring.
Explain what needle-in-a-haystack/passkey retrieval does and does not validate about real contract+policy reasoning.
Propose a combined evaluation plan that balances quality-focused metrics (e.g., correctness of cited clauses, robustness to distractors) with efficiency metrics (e.g., TTFT, inter-token latency/throughput, and cost/resource use) for a high-volume enterprise rollout.
Describe at least one concrete tradeoff you would be willing to make (or not make) between output quality and efficiency, and justify it for this product scenario.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related