Learn Before
Evaluating a Language Model's True Understanding
A new language model achieves 99% accuracy on a test set where the examples are structured very similarly to its training data. However, it performs poorly on a different set of tasks that require it to execute novel combinations of familiar commands (e.g., it was trained on 'walk twice' and 'jump', but tested on 'walk after jumping'). Critique the claim that the 99% accuracy score represents a comprehensive understanding of the language. What does the model's poor performance on the second task set reveal about the nature of its learning and the limitations of the initial evaluation method?
0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
SCAN Benchmark
A research team is designing a task to evaluate a language model's ability to understand and execute novel combinations of familiar instructions. The model will be trained on a set of commands and their corresponding action sequences. Which of the following training and testing splits would provide the most rigorous and direct assessment of the model's compositional reasoning capabilities?
Diagnosing LLM Generalization Failure
Evaluating a Language Model's True Understanding