Evaluating an LLM's Commonsense Reasoning Failure
An AI assistant is given the following prompt: 'I poured a glass of milk and heated it in the microwave for one minute. I then put a standard-sized ice cube into the hot milk. What will happen to the ice cube?' The AI responds: 'The ice cube will slowly cool the milk down, but it will likely remain mostly solid for a long time.' Evaluate the quality of the AI's response. In your evaluation, explain why this specific type of task is challenging for a language model, referencing the underlying principles of how these models process information.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A researcher is designing a test to specifically evaluate a Large Language Model's commonsense reasoning capabilities, which rely on implicit, real-world knowledge not explicitly stated in the prompt. Which of the following prompts would be the most effective for this specific purpose?
Analysis of a Commonsense Reasoning Failure
Evaluating an LLM's Commonsense Reasoning Failure