1Cademy - Debugging a Multi-Step LLM Workflow for Contract Clause Risk Triage

Learn Before

Case Study

Debugging a Multi-Step LLM Workflow for Contract Clause Risk Triage

You are rolling out an internal LLM assistant for Legal Ops to triage vendor contract clauses into three labels: (A) "standard/low risk", (B) "needs legal review", (C) "reject". The assistant must also produce a short justification that can be audited. In a pilot, you observe two recurring failures: (1) on complex clauses, the model gives confident but wrong labels unless it is guided through intermediate reasoning; (2) when you add a simple instruction like "Let’s think step by step," the model often produces a long rationale but sometimes forgets to output a clear final label.

You are not allowed to fine-tune the model. You can only change the prompting workflow and the content placed in the prompt. You have a strict context window budget, so you can include at most ONE worked example in the prompt, and you must keep the number of model calls per clause to no more than 3.

Case: A new clause says: "Vendor may use Customer Data to improve its services and may share Customer Data with affiliates and subcontractors for that purpose. Customer may opt out of data sharing by emailing support within 10 days of signing." Your policy requires: (i) no sharing of Customer Data with third parties for model training/service improvement unless explicitly prohibited or strongly constrained; (ii) opt-out mechanisms are considered insufficient for sensitive data; (iii) subcontractor access must be tightly limited.

Design a prompting workflow (describe the sequence of prompts/calls and what each prompt contains) that uses in-context learning, problem decomposition/least-to-most prompting, chain-of-thought (including zero-shot CoT where appropriate), and self-refinement to reduce both failure modes while staying within the constraints. Your answer must explain how the techniques interact (e.g., why you place a worked example where you do, how you decompose the decision, how you ensure a final label is always produced, and how critique/revision is used without exceeding 3 calls).

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related