Learn Before
Limitation of Test Loss in Predicting Downstream Performance
A significant caveat to scaling laws is that improvements in pre-training metrics, such as a lower test loss, do not automatically guarantee better performance on all downstream tasks. The final effectiveness of a Large Language Model is also shaped by subsequent adaptation processes, including fine-tuning and prompting.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Limitations of Monotonic Scaling Functions
Limitation of Test Loss in Predicting Downstream Performance
A research team develops a scaling function that accurately predicts their language model's performance on English text as they increase the model's parameter count. Confident in their findings, they use the same function to budget for a new, larger model intended for generating computer code. However, the final code-generation model performs significantly worse than the function predicted. Which statement best explains this outcome?
Evaluating a Compute Budgeting Strategy
A research lab has developed a scaling function that accurately predicts the performance of their specific 10-billion parameter language model on a large corpus of web text. This function can therefore be considered a reliable predictor for the performance of any other 10-billion parameter language model trained on a different large corpus of web text.
Learn After
Task-Specific Nature of Scaling Laws
A research lab pre-trains two language models, Model Alpha and Model Beta, on the same large text corpus. Model Alpha achieves a final test loss of 1.8, while Model Beta achieves a final test loss of 2.5. However, when both models are later adapted for a specialized legal document summarization task, Model Beta significantly outperforms Model Alpha. Which of the following statements provides the most likely explanation for this discrepancy?
Evaluating Model Selection Strategy
Model Selection for a Specialized Task
Interpreting Pre-training Metrics for Specialized Tasks