Combined Power Law for LLM Loss with Model and Dataset Size
To account for multiple factors simultaneously, the loss of a Large Language Model can be modeled as a function of both the number of model parameters, , and the size of the training dataset, . This relationship is captured by a combined scaling law developed by Rosenfeld et al., which incorporates an irreducible error term, , resulting in the formula: . In this equation, the terms and represent the independent contributions of model size and dataset size to the overall loss, following a power law.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined Power Law for LLM Loss with Model and Dataset Size
A research team observes that as they increase the computational resources (
x) used to train a language model, the model's final loss (L) decreases. However, the loss curve begins to flatten out, suggesting it is approaching a minimum value greater than zero and will not improve further, regardless of additional resources. Given the relationshipL(x) = ax^b + ε_∞, which component of the formula is responsible for this 'performance floor' phenomenon?Comparing LLM Training Potential
Evaluating a Model Training Proposal
Combined Power Law for LLM Loss with Model and Dataset Size
A research team is deciding between two language model sizes. Model A will have 10 billion parameters, and Model B will have 100 billion parameters. According to the empirical relationship where performance loss (L) is a function of the number of parameters (N), as shown in the formula below, which model should the team choose to achieve a lower final loss, and what is the justification?
Interpreting Model Scaling Effects
Interpreting the Model Scaling Formula
Visualizing Empirical Scaling Laws for LLM Loss
Power Law Fit for Test Loss vs. Model and Dataset Size
Combined Power Law for LLM Loss with Model and Dataset Size
Predicting LLM Performance Based on Dataset Size
A research team observes that their language model's loss (L) decreases as the training dataset size (D) increases, following the specific power law: where C is a large constant and the exponent α is a small positive number (e.g., 0.095). Based on this mathematical relationship, what is the most significant implication for the team as they consider scaling up their training data from an already very large starting point?
Calculating Loss Reduction from Increased Dataset Size
Power Law Fit for Test Loss vs. Model and Dataset Size
Learn After
Chinchilla Scaling Law
A research team is working to improve a large language model and is using the combined power law,
L(N,D) = aN^b + cD^d + ε_∞, to guide their efforts. Their analysis shows that the termaN^b, which depends on the model's parameter count (N), is currently the largest contributor to the total loss. The termcD^d, which depends on the dataset size (D), is comparatively small. To achieve the most significant reduction in loss with their limited resources, what should the team prioritize?Diagnosing LLM Training Plateaus
Optimizing LLM Training Strategy