Case Study

Debugging a Stagnating Prompt Optimizer and Designing a More Reliable Deployment

You are responsible for an internal LLM feature that extracts three fields from vendor contracts ("termination notice period", "auto-renewal", and "governing law") and returns a JSON object. The business constraint is that the system must keep token usage low and must be robust to weekly model version updates. Your team built an automated prompt optimization pipeline that treats prompts as candidates in a search process: it starts with 30 seed prompts, evaluates each on a 200-document validation set, keeps the top 5, and then uses the LLM to generate 25 new prompts by "improving" those top 5. After 8 cycles, the best score has plateaued and the top prompt is brittle: it performs well on the validation set but fails on a new batch of contracts with different formatting. You are considering adding (a) prompt ensembling at inference time and (b) an evolutionary computation step (selection + crossover + mutation) to generate candidates instead of only LLM-written rewrites.

As the lead, propose a revised end-to-end approach that (1) explains why the current iterative LLM-based prompt search is likely stagnating and overfitting, (2) specifies how you would change the search space, search strategy, and performance estimation to reduce brittleness, and (3) justifies where and how you would use prompt ensembling versus evolutionary operators to balance reliability gains against token-cost constraints. Your answer should be concrete enough that an engineer could implement the next experiment (e.g., what gets evaluated, what gets pruned, what gets expanded, and how outputs are aggregated).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related