1Cademy - Designing a Decomposition Workflow for Root-Cause Analysis of a Production Incident

Learn Before

Case Study

Designing a Decomposition Workflow for Root-Cause Analysis of a Production Incident

You are building an internal LLM-powered assistant for Site Reliability Engineers (SREs) to produce a root-cause analysis (RCA) draft within 30 minutes of a Sev-2 incident. The assistant receives: (1) a short incident timeline written by the on-call engineer, (2) links to 6 log excerpts (plain text), and (3) a runbook. The assistant must output an RCA draft with: suspected root cause, contributing factors, customer impact, and 3 concrete follow-up actions.

Constraints: The LLM context window is limited, so you cannot paste all logs at once. The incident timeline is sometimes wrong or incomplete. Some sub-questions (e.g., “What changed in the last deploy?”) often require further breakdown (e.g., identify deploy ID → list changed services → inspect config diffs). You also need traceability: the RCA draft must cite which intermediate findings it relied on.

Case study task: Propose a decomposition-based workflow that (a) generates an initial set of sub-problems, (b) solves them sequentially while carrying forward prior Q/A pairs as context, and (c) supports recursive decomposition when a sub-problem is too complex to answer directly. Then, justify one key design tradeoff you make between (i) generating all sub-problems up front vs generating them dynamically as new evidence appears, and (ii) including all prior Q/A pairs vs selectively summarizing them to fit the context window. Your answer should be specific enough that an engineer could implement the prompting/orchestration logic.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related