Case Study

Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory

You are the on-call ML engineer for an internal LLM that answers questions over very long engineering incident reports (50–200 pages). The model must run on a single GPU with a strict, constant upper bound on inference-time memory. Two prototype memory designs are being compared:

Design A (Fixed-Window Local Attention): the attention mechanism only retains key/value pairs for the most recent 512 tokens (a sliding window). Anything older is not available to attention.

Design B (Compressive Transformer-style Dual Memory): the model keeps (1) a fixed-size local memory for the most recent 512 tokens in full fidelity and (2) a fixed-size compressed memory that stores a lossy summary of older key/value pairs. The model processes the document in sequential segments; when a new segment arrives, the local memory is updated FIFO, and the evicted portion is compressed and appended into the compressed memory (evicting/overwriting older compressed entries as needed to keep it fixed-size). Attention is computed over the concatenation of local + compressed memory.

During evaluation, both designs handle questions about the last few pages well. However, for questions like “What was the first mitigation attempted and why was it rolled back later?”, Design A often answers confidently but incorrectly, while Design B is usually correct but sometimes misses exact wording (e.g., the precise error code) from early pages.

As the person recommending which design to ship, write a brief decision memo (6–10 sentences) that: (1) explains the observed behavior of both designs in terms of what their memory is encoding as context, (2) identifies the key trade-off between constant-memory local windows and segment-based recurrent updates with compression, and (3) recommends one design for this use case, including one concrete mitigation for its main weakness.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related