Learn Before
Using Log-Likelihood to Evaluate Prompts
An alternative approach to measuring a prompt's effectiveness is to use the log-likelihood of the output generated by a Large Language Model. This metric serves as a quantitative indicator of the prompt's quality.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Evaluating Prompts with Pre-defined Metrics
Using Log-Likelihood to Evaluate Prompts
Pruning the Prompt Candidate Pool
A team is developing a system to classify customer feedback emails as 'Urgent' or 'Not Urgent'. They have created a set of 20 different instruction prompts to guide a language model in this classification task. To determine the best prompt, they select one sample 'Urgent' email and test each of the 20 prompts on it. They decide to choose the prompt that successfully leads the model to classify this single email as 'Urgent'. What is the most significant flaw in this evaluation strategy?
A developer has created a set of candidate prompts to make a language model summarize news articles. To find the best prompt, each one must be evaluated. Arrange the following actions into the correct logical sequence for evaluating a single candidate prompt across a dataset of articles.
Evaluating Prompts for a Customer Support Chatbot
Learn After
A developer is testing two prompts for a text summarization task.
- Prompt 1 results in a summary with a very high log-likelihood score from the model, but human evaluators rate the summary as 'poor' because it misses key points.
- Prompt 2 results in a summary with a lower log-likelihood score, but human evaluators rate the summary as 'excellent' because it accurately captures all key points.
Based on this scenario, what is the most accurate conclusion about evaluating these prompts?
Prompt Evaluation for a Factual Task
Comparing Prompt Evaluation Strategies