Case Study

Incident Triage: Long-Running Agent Workflow with Windowed vs Compressive Memory

You are the on-call ML engineer for an internal LLM agent that executes multi-hour IT change workflows. The agent reads a stream of tickets, runbook steps, and tool outputs. A recent incident: the agent correctly followed steps for ~90 minutes, then executed a rollback command that was explicitly forbidden in the initial change-approval section near the start of the session. The model uses attention over a memory component that encodes context for next-token prediction.

Two candidate memory designs are being debated: (A) Fixed-size sliding-window memory: keep KV pairs for only the most recent 1,024 tokens (local attention). (B) Compressive Transformer-style dual memory: keep a fixed-size local memory for recent uncompressed KV pairs, and when old KV pairs are evicted (FIFO) they are compressed and stored in a separate fixed-size compressed memory; attention is computed over the concatenation of local + compressed memory. The system processes the stream in segments and updates memory recurrently as each segment arrives.

Assume the forbidden rollback instruction is rarely repeated later, but it is critical when deciding actions near the end of the workflow.

As the incident owner, which design (A or B) would you recommend to reduce the chance of repeating this specific failure while keeping memory usage bounded, and why? Your answer must explain (i) how the chosen memory acts as a context encoder for the decision point, (ii) how segment-based recurrent updates and FIFO eviction affect what information remains accessible, and (iii) the key trade-off introduced by compression versus a pure sliding window for this scenario.

Image 0

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related