Learn Before
Evaluating LLM Reasoning Outputs
An engineer provides a large language model with the following math word problem: 'A grocery store had 250 apples. They sold 120 on Monday and then received a new shipment of 85 apples on Tuesday. How many apples do they have now?'
Output 1 (from a prompt asking for only the final answer): 'The store has 455 apples.'
Output 2 (from a prompt asking for a step-by-step explanation before the final answer): 'Let's break this down.
- Start with 250 apples.
- Sell 120 apples: 250 - 120 = 130 apples.
- Receive 85 new apples: 130 + 85 = 215 apples. Therefore, the store has 215 apples.'
Critically evaluate the two outputs. Explain the likely reason for the significant difference in correctness by analyzing the process (or lack thereof) demonstrated in each. Based on your analysis, what general conclusion can you draw about eliciting more reliable solutions from language models for multi-step problems?
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Novel Prompting Strategy
A researcher is trying to get a language model to solve a multi-step logic puzzle. They test two different prompts:
Prompt A: 'What is the solution to the following logic puzzle? [Puzzle text]'
Prompt B: 'Solve the following logic puzzle. First, break down the puzzle into individual facts and constraints. Next, reason through the implications of each fact step-by-step. Finally, state your conclusion and explain how you arrived at it. [Puzzle text]'
Which statement best analyzes why Prompt B is likely to yield a more accurate solution for this type of task?
Evaluating LLM Reasoning Outputs
Explicit Prompting for Extended Deliberation
Modifying Decoding for Longer Reasoning Paths
Multi-Stage Generation for Incremental Reasoning