Evaluating LLM Arithmetic Inference
A user provides a language model with the word problem below. Evaluate the two generated responses. Which response more successfully translates the natural language problem into a correct sequence of mathematical operations? Justify your choice by identifying the specific logical flaw in the unsuccessful response.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.3 Prompting - Foundations of Large Language Models
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Example of a Probability-Based Word Problem for LLMs
Example of a Multi-Step Arithmetic Word Problem (Swimming Pool)
Example of a Mathematical Reasoning Word Problem (Jessica's Apps)
Example of a Multi-Step Arithmetic Word Problem (Tom's Marbles)
A large language model was given the following word problem: 'A bakery had 20 muffins. They sold 12 muffins and then baked 3 dozen more. How many muffins does the bakery have now?' The model produced this response: 'First, we start with 20 muffins. They sold 12, so 20 - 12 = 8. Then they baked 3 more, so 8 + 3 = 11. The final answer is 11.' Which statement best analyzes the primary reasoning failure in the model's response?
Chain-of-Thought (COT) Prompting
Example of a Multi-Step Arithmetic Word Problem (Jack's Apples)
Evaluating LLM Arithmetic Inference
A language model is tasked with solving arithmetic word problems. Below are common types of errors it might make when translating language into a sequence of mathematical operations. Match each error type with the scenario that best exemplifies it.