Diagnosing LLM Training Plateaus
A research team is training a large language model and observes that after weeks of training, adding 50% more high-quality data to their training set results in a negligible decrease in the model's final loss. Based on the principle that loss (L) is a function of model size (N) and dataset size (D) as described by the relationship L(N,D) = aN^b + cD^d + ε_∞, propose two distinct and plausible explanations for this phenomenon. For each explanation, identify the specific term(s) in the equation that would be the primary cause.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Chinchilla Scaling Law
A research team is working to improve a large language model and is using the combined power law,
L(N,D) = aN^b + cD^d + ε_∞, to guide their efforts. Their analysis shows that the termaN^b, which depends on the model's parameter count (N), is currently the largest contributor to the total loss. The termcD^d, which depends on the dataset size (D), is comparatively small. To achieve the most significant reduction in loss with their limited resources, what should the team prioritize?Diagnosing LLM Training Plateaus
Optimizing LLM Training Strategy