1Cademy - Empirical Power Law for LLM Loss vs. Dataset Size (D)

Learn Before

Power Law Formula for LLM Loss
Test Loss Scaling with Dataset Size

Formula

Empirical Power Law for LLM Loss vs. Dataset Size (D)

An empirical fit formulated by Kaplan et al. (2020) demonstrates that a language model's test loss, denoted by $\mathcal{L}$ , decreases as a power-law function of the training dataset size, denoted by $D$ . After an initial transient phase, this relationship is mathematically defined as $\mathcal{L}(D) = \big( \frac{D}{5.4 \times 10^{13}} \big)^{-0.095}$ . In this equation, $-0.095$ is the scaling exponent and ${}5.4 \times 10^{13}$ is an empirically derived constant.