Learn Before
Comparing Single-Variable Scaling Functions
A research team has developed two separate mathematical functions to model their language model's performance. Function A describes the model's final loss solely as a function of the training dataset size (while holding model size constant). Function B describes the model's final loss solely as a function of the number of model parameters (while holding dataset size constant). Explain why relying on only one of these functions could lead to a suboptimal training strategy for a new, larger model.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Absence of a Universal Scaling Law
A research team is developing a new language model. They train several versions of the model, each with a different number of parameters, while keeping the training dataset size fixed. They plot the final training loss for each model version against its parameter count. The resulting graph shows a consistent, downward-curving trend: as the number of parameters increases, the loss decreases, but the amount of improvement gets smaller with each increase. Based on this observation, what is the most accurate conclusion the team can draw?
Optimizing LLM Training Budget
Comparing Single-Variable Scaling Functions