Analyzing LLM Performance
An AI development team is evaluating their new instruction-tuned language model. They observe two distinct behaviors:
- The model excels at summarizing scientific articles, even those from fields it wasn't explicitly trained on, as long as the instruction is "summarize this text."
- The model struggles when given a mix of instructions it has seen before, such as "translate this sentence," "write a poem," and "explain this concept," often confusing the required output formats.
Based on this scenario, identify which type of generalization the model demonstrates well and which type it lacks. Briefly justify your answer for each.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
LLM Generalization Evaluation
Definition of Intra-Task Generalization
Formal Definition of Intra-Task Generalization
An AI team fine-tunes a language model exclusively on a dataset for a single task: translating English legal documents into French. The model is then evaluated on two test sets.
- Test Set A: A new, unseen collection of English legal documents to be translated into French.
- Test Set B: A collection of diverse tasks, such as writing Python code, composing poetry, and summarizing news articles.
The model performs very well on Test Set A but performs poorly on Test Set B. What does this evaluation reveal about the model's generalization abilities?
Analyzing LLM Performance
Formula for Generalization Across Tasks