Case Study

Designing a Prompt-Optimization-and-Ensembling Strategy for a Multi-Model Enterprise Rollout

You lead an applied AI team rolling out an LLM-based “contract clause risk flagger” used by Legal Ops. The system must work across two vendor LLMs (Model A and Model B) because different business units are locked into different providers. You have a labeled evaluation set of 2,000 clauses, but only 200 can be sent to human review per week for high-quality scoring. In offline tests, single prompts show high variance: a prompt that is best on Model A is often mediocre on Model B, and small wording changes can flip outcomes. You also have a strict runtime budget: at most 2 LLM calls per clause in production.

Propose a concrete automated prompt design approach that (1) frames prompt optimization as a search problem (define the search space, search strategy, and performance estimation), (2) uses an iterative LLM-based refinement loop (evaluation–pruning–expansion) to discover candidates, (3) incorporates an evolutionary computation element (e.g., mutation/crossover) to maintain useful diversity, and (4) ends with a prompt ensembling plan that fits the 2-call runtime limit while improving cross-model robustness.

In your answer, explicitly justify at least two tradeoffs you are making (e.g., exploration vs. cost, diversity vs. convergence, offline score vs. cross-model portability) and specify a stopping condition for the search.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related