Empirical Power Law for LLM Loss vs. Dataset Size (D)
An empirical fit formulated by Kaplan et al. (2020) demonstrates that a language model's test loss, denoted by , decreases as a power-law function of the training dataset size, denoted by . After an initial transient phase, this relationship is mathematically defined as . In this equation, is the scaling exponent and is an empirically derived constant.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Empirical Power Law for LLM Loss vs. Model Size (N)
Empirical Power Law for LLM Loss vs. Dataset Size (D)
Two language models, Model A and Model B, have their performance (loss, L) modeled as a function of a resource
x(wherex > 1). The relationship for each is described by a power law equation:- Model A:
L(x) = 0.5 * x^-0.1 - Model B:
L(x) = 0.5 * x^-0.2
Based on these equations, which statement correctly analyzes the models' improvement as more of the resource
xis used?- Model A:
Interpreting the Power Law Exponent
Model Selection Based on Performance Scaling
A machine learning team is training a series of language models. They systematically increase the size of the training dataset for each new model and record the final test loss. When they plot the test loss versus the dataset size on a graph where both axes use a logarithmic scale, they observe the points form a nearly straight, downward-sloping line. What is the most valid interpretation of this trend?
Three Phases of LLM Scaling with Dataset Size
Strategic Model Improvement
Interpreting Training Anomalies
Empirical Power Law for LLM Loss vs. Dataset Size (D)
Learn After
Combined Power Law for LLM Loss with Model and Dataset Size
Predicting LLM Performance Based on Dataset Size
A research team observes that their language model's loss (L) decreases as the training dataset size (D) increases, following the specific power law: where C is a large constant and the exponent α is a small positive number (e.g., 0.095). Based on this mathematical relationship, what is the most significant implication for the team as they consider scaling up their training data from an already very large starting point?
Calculating Loss Reduction from Increased Dataset Size
Power Law Fit for Test Loss vs. Model and Dataset Size