Multiple Choice

A team is developing a system to classify customer feedback emails as 'Urgent' or 'Not Urgent'. They have created a set of 20 different instruction prompts to guide a language model in this classification task. To determine the best prompt, they select one sample 'Urgent' email and test each of the 20 prompts on it. They decide to choose the prompt that successfully leads the model to classify this single email as 'Urgent'. What is the most significant flaw in this evaluation strategy?

0

1

Updated 2025-09-29

Contributors are:

Who are from:

Tags

Ch.3 Prompting - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science