Evaluating Demonstration Sufficiency in a Prompt
A developer is creating a prompt to extract a 'Product Name' and 'Issue Type' from customer support tickets. The model is performing poorly on new tickets. Below is the prompt they are using.
Prompt Start
Extract the product name and issue type from the following support tickets.
Ticket: "My new QuantumLeap laptop won't turn on. I've tried plugging it in, but nothing happens." Product Name: QuantumLeap laptop Issue Type: Power Failure
Ticket: "The screen on my QuantumLeap laptop is flickering constantly. It's very distracting." Product Name: QuantumLeap laptop Issue Type: Display Malfunction
Ticket: "I can't connect my StellarSound headphones to my phone via Bluetooth." Product Name: Issue Type:
Prompt End
Evaluate the two demonstrations provided in the prompt. Explain why this set of examples is likely insufficient for the model to reliably handle a wide variety of support tickets.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A developer is trying to get a language model to classify short movie reviews as 'Positive', 'Negative', or 'Neutral'. They test two different sets of instructions, shown below.
Instructions A: Classify the following movie review. Review: The plot was predictable and the acting was wooden. Classification: Negative
Review: This film was an absolute masterpiece from start to finish. Classification:
Instructions B: Classify the following movie reviews. Review: The plot was predictable and the acting was wooden. Classification: Negative
Review: It wasn't a bad movie, but it wasn't particularly memorable either. Classification: Neutral
Review: This film was an absolute masterpiece from start to finish. Classification: Positive
Review: I have seen better, but it was an enjoyable way to spend an afternoon. Classification:
Why are 'Instructions B' significantly more likely to lead to a correct and reliable classification for the final review compared to 'Instructions A'?
Diagnosing Few-Shot Learning Failures
Evaluating Demonstration Sufficiency in a Prompt