1Cademy - Post-Incident Review: Memory Design for Long-Running Customer Support Chats

Learn Before

Essay

Post-Incident Review: Memory Design for Long-Running Customer Support Chats

You are leading a post-incident review for an LLM-powered customer support assistant that handles chat sessions lasting 2–6 hours. The current system uses a fixed-size sliding-window KV cache of the most recent 512 tokens for attention (to keep latency and GPU memory stable). In a recent incident, the assistant repeatedly contradicted an earlier, critical customer constraint ("do not disclose pricing to third parties") that was stated near the beginning of the chat, even though the last 512 tokens contained no mention of it.

You are asked to propose a revised memory approach that still keeps attention-time memory bounded, but reduces the risk of losing important early constraints. Write an evaluation that:

Explains, using the idea of a memory model as a context encoder, why the sliding-window design failed in this incident (be explicit about what information is and is not representable at prediction time).
Proposes a concrete architecture that combines (a) a fixed-size local memory for recent tokens and (b) a fixed-size compressed long-term memory, and describes how the two are combined for attention at inference.
Describes how the memory is updated recurrently using segments over the course of the chat (what happens when a new segment arrives, what gets evicted from local memory, and how it becomes part of the compressed memory).
Critically discusses at least two tradeoffs/risks introduced by compression and segment-based updates (e.g., what kinds of errors or information loss might occur, and how that compares to the original sliding-window approach).

Assume you cannot increase the 512-token local window, and you cannot store the full uncompressed history.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related