An AI team fine-tunes a language model exclusively on a dataset for a single task: translating English legal documents into French. The model is then evaluated on two test sets.
- Test Set A: A new, unseen collection of English legal documents to be translated into French.
- Test Set B: A collection of diverse tasks, such as writing Python code, composing poetry, and summarizing news articles.
The model performs very well on Test Set A but performs poorly on Test Set B. What does this evaluation reveal about the model's generalization abilities?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
LLM Generalization Evaluation
Definition of Intra-Task Generalization
Formal Definition of Intra-Task Generalization
An AI team fine-tunes a language model exclusively on a dataset for a single task: translating English legal documents into French. The model is then evaluated on two test sets.
- Test Set A: A new, unseen collection of English legal documents to be translated into French.
- Test Set B: A collection of diverse tasks, such as writing Python code, composing poetry, and summarizing news articles.
The model performs very well on Test Set A but performs poorly on Test Set B. What does this evaluation reveal about the model's generalization abilities?
Analyzing LLM Performance
Formula for Generalization Across Tasks