1Cademy - Stabilizing an Instruction-Tuned Support Assistant When Synthetic Data Conflicts with Human Policy

Learn Before

Case Study

Stabilizing an Instruction-Tuned Support Assistant When Synthetic Data Conflicts with Human Policy

You lead an internal ML team building an instruction-following assistant for your company’s customer support agents. You have a strong pre-trained base model and a small, high-quality seed set of 2,000 human-written instruction–response examples that reflect company policy (tone, escalation rules, and compliance language). To scale quickly, the team proposes: (1) using Self-Instruct to generate 300,000 new instructions, (2) using a smaller, cheaper “weak” model to generate the responses for those instructions, and then (3) instruction fine-tuning the strong model on the combined dataset.

After a pilot fine-tune, offline evaluation shows mixed results: the model follows diverse instructions better, but it sometimes gives confidently wrong policy guidance and occasionally adopts an overly casual tone. A spot-check finds that many synthetic examples are plausible but subtly conflict with policy, and some are near-duplicates.

As the decision-maker, what end-to-end data strategy would you implement for the next iteration (covering automatic data generation, selection/filtering, and how you would use weak-model-generated data in instruction fine-tuning) to improve instruction-following breadth without amplifying weak-model errors or drifting from policy? Justify your choices by explaining the key tradeoffs and failure modes you are addressing.

Updated 2026-02-06

Contributors are:

Who are from:

Learn Before

Related