Evaluating LLM Response Completeness
A developer is testing a large language model's ability to solve multi-step problems. They provide a prompt that concludes with a phrase intended to elicit a detailed reasoning process. Below are two responses generated by the model in different test runs for the same prompt. Evaluate the two responses. Which response is more effective for a user seeking a definitive solution, and what specific pitfall does Response A illustrate?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Answer Extraction via Second-Round Prompting
A user gives a large language model the following prompt:
"Prompt: At a conference, there are 3 presentation rooms. Each room has 8 rows of chairs, and each row has 10 chairs. If all chairs are filled, how many people are attending the presentations? Let's think step by step."
The model provides the following output:
"Output: Okay, let's break this down. First, we need to find the total number of chairs in one room. There are 8 rows with 10 chairs each, so 8 rows * 10 chairs/row = 80 chairs per room. Then, we need to find the total number of chairs across all rooms. There are 3 rooms, so 3 rooms * 80 chairs/room = 240 chairs."
Analyze the model's output. What is the primary issue with its response?
Evaluating LLM Response Completeness
Predicting Prompting Pitfalls