Formula for Generalization Across Tasks
Generalization across tasks occurs when an instruction-fine-tuned model's average performance over all new instruction-input pairs is above a predefined threshold value, . This condition is mathematically expressed as:
where is the set of new instruction-input pairs, represents a specific new instruction and input from the set, and is the corresponding model output.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
LLM Generalization Evaluation
Definition of Intra-Task Generalization
Formal Definition of Intra-Task Generalization
An AI team fine-tunes a language model exclusively on a dataset for a single task: translating English legal documents into French. The model is then evaluated on two test sets.
- Test Set A: A new, unseen collection of English legal documents to be translated into French.
- Test Set B: A collection of diverse tasks, such as writing Python code, composing poetry, and summarizing news articles.
The model performs very well on Test Set A but performs poorly on Test Set B. What does this evaluation reveal about the model's generalization abilities?
Analyzing LLM Performance
Formula for Generalization Across Tasks
Learn After
Evaluating Inter-Task Generalization
A language model's ability to generalize to new tasks is evaluated using a set of 5 new instruction-input pairs. The model's performance on each pair is scored on a scale of 0 to 1, yielding the scores: [0.9, 0.8, 0.3, 0.2, 0.7]. According to the formal condition for inter-task generalization, which is defined as the average performance over the new set exceeding a threshold (ε), does this model demonstrate this capability if the threshold is set at ε = 0.6?
A model's capability to perform well across a variety of different tasks is formally assessed using the condition: In this expression, what is the most critical characteristic of the set of new instruction-input pairs, denoted by , for a valid evaluation?