Case Study

Selecting and Justifying a Long-Context Memory Design for a Regulated Audit Assistant

You are deploying an internal LLM assistant that helps compliance analysts answer questions about a single, very long audit package (hundreds of pages) while they work through it over several hours. Analysts frequently ask questions that require (a) exact wording from the last 1–2 pages they just read, and (b) occasionally referencing a definition or exception that appeared much earlier (e.g., 80 pages back). The system must run on a fixed GPU budget, so you cannot keep an ever-growing full KV cache for the entire document. You are considering two designs:

Design A: A fixed-size sliding-window attention cache that stores only the most recent N tokens (local attention).

Design B: A dual-memory “compressive” design with a fixed-size high-fidelity local memory (recent tokens) plus a fixed-size compressed long-term memory; the model processes the document in sequential segments and updates memory recurrently as each new segment arrives, evicting older local content into the compressed memory.

Assume compression is lossy but space-efficient, and both memories are used together as the attention context at inference time.

Which design (A or B) would you recommend for this product, and why? In your answer, explicitly explain how the chosen memory model functions as a context encoder for both near-term exactness and long-range recall, and identify one concrete failure mode/trade-off your choice introduces (e.g., what kind of question the assistant may answer worse, and why).

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related