Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory
You are the on-call ML engineer for an internal LLM that answers questions over very long engineering incident reports (50–200 pages). The model must run on a single GPU with a strict, constant upper bound on inference-time memory. Two prototype memory designs are being compared:
Design A (Fixed-Window Local Attention): the attention mechanism only retains key/value pairs for the most recent 512 tokens (a sliding window). Anything older is not available to attention.
Design B (Compressive Transformer-style Dual Memory): the model keeps (1) a fixed-size local memory for the most recent 512 tokens in full fidelity and (2) a fixed-size compressed memory that stores a lossy summary of older key/value pairs. The model processes the document in sequential segments; when a new segment arrives, the local memory is updated FIFO, and the evicted portion is compressed and appended into the compressed memory (evicting/overwriting older compressed entries as needed to keep it fixed-size). Attention is computed over the concatenation of local + compressed memory.
During evaluation, both designs handle questions about the last few pages well. However, for questions like “What was the first mitigation attempted and why was it rolled back later?”, Design A often answers confidently but incorrectly, while Design B is usually correct but sometimes misses exact wording (e.g., the precise error code) from early pages.
As the person recommending which design to ship, write a brief decision memo (6–10 sentences) that: (1) explains the observed behavior of both designs in terms of what their memory is encoding as context, (2) identifies the key trade-off between constant-memory local windows and segment-based recurrent updates with compression, and (3) recommends one design for this use case, including one concrete mitigation for its main weakness.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.4 Alignment - Foundations of Large Language Models
Related
Adequate Capacity in Memory Models
Goal of Practical Memory Models: Accessing Important Context
Defining Memory Capacity in LLMs
Analysis of a Summarizing Memory Model
An engineer proposes a new memory model for a large language model designed to process very long documents. To save memory, this model only stores the key-value pairs for the most recent 512 tokens of the input sequence. From the perspective of the memory model's primary function as a context encoder, what is the most critical limitation of this approach?
Comparing Context Encoding Strategies in Memory Models
Choosing a Memory Architecture for Long-Context Enterprise Summarization
Diagnosing Long-Range Failures in a Segment-Processed LLM with Dual Memory
Post-Incident Review: Memory Design for Long-Running Customer Support Chats
Selecting and Justifying a Long-Context Memory Design for a Regulated Audit Assistant
Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory
Incident Triage: Long-Running Agent Workflow with Windowed vs Compressive Memory
You are reviewing two candidate memory designs for...
Your team is documenting the memory subsystem of a...
You’re deploying an internal LLM assistant that mu...
You’re designing an internal LLM feature that moni...
Attention Formula in Compressive Transformer
Segment-based Operation in Compressive Transformer
FIFO Memory Update in Compressive Transformer
Differential Compression in Compressive Transformer Memory
A language model is designed with two distinct memory components for its attention mechanism: a fixed-size memory for recent, high-fidelity context and a separate fixed-size memory for a compressed representation of older context. What is the primary architectural advantage of this dual-memory approach for processing very long sequences?
Memory Dynamics in a Dual-Cache System
A transformer model is designed to handle long sequences using a dual-memory system: a fixed-size local memory for recent, uncompressed context and a fixed-size compressed memory for older context. Arrange the following steps in the correct chronological order to describe how this system processes and archives a new segment of information.
Your team is documenting the memory subsystem of a...
You are reviewing two candidate memory designs for...
You’re deploying an internal LLM assistant that mu...
You’re designing an internal LLM feature that moni...
Post-Incident Review: Memory Design for Long-Running Customer Support Chats
Diagnosing Long-Range Failures in a Segment-Processed LLM with Dual Memory
Choosing a Memory Architecture for Long-Context Enterprise Summarization
Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory
Selecting and Justifying a Long-Context Memory Design for a Regulated Audit Assistant
Incident Triage: Long-Running Agent Workflow with Windowed vs Compressive Memory
Final Memory State as a Comprehensive Context Representation
Fine-Tuning LLMs for Context Representation Tasks
A model is designed to understand a long document by processing it in three sequential parts: Segment 1, Segment 2, and Segment 3. The model maintains a memory state that is updated after processing each segment, incorporating information from the current segment with the memory from the previous one. After the model has finished processing Segment 2, which of the following best describes the contents of its memory state?
A memory-augmented model processes a long document by breaking it into sequential segments. For any given segment (after the first one), arrange the following actions in the correct order to describe how the model updates its memory state.
Diagnosing Information Loss in a Sequential Processing Model
Your team is documenting the memory subsystem of a...
You are reviewing two candidate memory designs for...
You’re deploying an internal LLM assistant that mu...
You’re designing an internal LLM feature that moni...
Post-Incident Review: Memory Design for Long-Running Customer Support Chats
Diagnosing Long-Range Failures in a Segment-Processed LLM with Dual Memory
Choosing a Memory Architecture for Long-Context Enterprise Summarization
Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory
Selecting and Justifying a Long-Context Memory Design for a Regulated Audit Assistant
Incident Triage: Long-Running Agent Workflow with Windowed vs Compressive Memory
Formula for Fixed-Size Window Memory
Window-based Cache as an Example of Fixed-Size Memory
Space Complexity of Sliding Window Attention
Window Size (n_c)
A language model is designed to process extremely long sequences of text, and its developers are concerned about computational resources. They are considering two approaches for the attention mechanism: one that considers all previous tokens in the sequence, and another that only considers a fixed-size window of the 100 most recent tokens. What is the fundamental trade-off between these two approaches?
Applying Sliding Window Attention
In an attention mechanism that uses a fixed-size sliding window, the amount of memory required to store the keys and values for the attention calculation increases as the input sequence gets longer.
Your team is documenting the memory subsystem of a...
You are reviewing two candidate memory designs for...
You’re deploying an internal LLM assistant that mu...
You’re designing an internal LLM feature that moni...
Post-Incident Review: Memory Design for Long-Running Customer Support Chats
Diagnosing Long-Range Failures in a Segment-Processed LLM with Dual Memory
Choosing a Memory Architecture for Long-Context Enterprise Summarization
Postmortem: Long-Document QA Failures Under Fixed-Window vs Compressive Memory
Selecting and Justifying a Long-Context Memory Design for a Regulated Audit Assistant
Incident Triage: Long-Running Agent Workflow with Windowed vs Compressive Memory