Learn Before
Evaluating Prompts for a Customer Support Chatbot
A company is optimizing a language model to summarize customer complaint emails for its support agents. The goal is to produce a concise, one-sentence summary that accurately captures the core issue. The team is testing two candidate prompts on a validation set of 100 different emails. After generating a summary for each email using both prompts, they had human reviewers score each summary as either 'Accurate' or 'Inaccurate'. Review the results below and determine which prompt is more effective, justifying your choice based on the fundamental principle of prompt evaluation.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating Prompts with Pre-defined Metrics
Using Log-Likelihood to Evaluate Prompts
Pruning the Prompt Candidate Pool
A team is developing a system to classify customer feedback emails as 'Urgent' or 'Not Urgent'. They have created a set of 20 different instruction prompts to guide a language model in this classification task. To determine the best prompt, they select one sample 'Urgent' email and test each of the 20 prompts on it. They decide to choose the prompt that successfully leads the model to classify this single email as 'Urgent'. What is the most significant flaw in this evaluation strategy?
A developer has created a set of candidate prompts to make a language model summarize news articles. To find the best prompt, each one must be evaluated. Arrange the following actions into the correct logical sequence for evaluating a single candidate prompt across a dataset of articles.
Evaluating Prompts for a Customer Support Chatbot