Multiple Choice

An AI's multi-step solution to a complex problem is evaluated by a separate model that classifies each step as either 'correct' or 'incorrect'. The final quality score for the entire solution is calculated by summing the total number of steps classified as 'correct'. What is a primary conceptual limitation of this evaluation approach?

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science