Designing a Robust LLM Evaluation
You are tasked with evaluating a new large language model's ability to understand a lengthy research paper. To specifically test against the risk of superficial understanding, describe one type of question you would ask the model. Explain why your proposed question is more likely to reveal true comprehension than a question that simply asks the model to find a specific fact mentioned in the paper.
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An AI model is evaluated on its ability to understand a long, complex historical document. When asked, 'What year is mentioned in the third sentence of the 27th paragraph?', the model answers correctly. However, when asked, 'Based on the author's arguments in the first and final chapters, what is the author's primary critique of the events described?', the model provides a vague summary of the entire document without identifying the specific critique. Which of the following is the most likely explanation for this discrepancy in performance?
Diagnosing AI Performance in a Legal Context
Designing a Robust LLM Evaluation