Case Study

Selecting a Robust Automated Prompt Optimization Approach Under Noisy Evaluation and Latency Constraints

You lead an applied LLM team building an internal tool that classifies employee IT help-desk tickets into one of 25 routing categories. The business requires (1) stable accuracy week-to-week despite small wording changes in tickets, (2) average end-to-end latency under 800 ms per ticket, and (3) a hard budget that limits you to at most 2,000 total model calls per day for both optimization and production. You have a labeled validation set of 5,000 historical tickets, but labels are somewhat noisy (some tickets were misrouted historically). You can run offline experiments nightly, but production must be deterministic and auditable.

Your team proposes two competing designs:

Design A (Single-Prompt Evolution): Treat prompts as a population and use an evolutionary algorithm nightly: evaluate each prompt on a sampled subset of the validation set, select top performers, generate new prompts via crossover (combining phrases) and mutation (small random edits), and repeat for 20 generations. Deploy the single best prompt found.

Design B (Iterative LLM Search + Ensemble): Start with 30 seed prompts. Each nightly cycle: (i) evaluate all prompts on a fixed validation subset, (ii) prune to the top 8, (iii) use an LLM to expand each of the 8 into 5 new variants (40 new prompts), and repeat for 6 cycles. In production, run the top 3 prompts and aggregate their predicted category by majority vote.

Assume one model call is required per (ticket, prompt) evaluation, and one model call per prompt variant generated during expansion. Also assume majority voting requires running all 3 prompts at inference time.

Which design would you recommend and why? In your answer, explicitly (a) frame the problem as a search process by identifying the search space, search strategy, and performance estimation issues that matter here, (b) explain how ensembling interacts with prompt search to address stability under noisy labels and ticket wording variation, and (c) justify how your choice can be made to fit the daily call budget and latency constraint (you may propose a small modification to the chosen design, but keep it consistent with the design’s core idea).

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Data Science

Related