Learn Before
Formula

Empirical Power Law for LLM Loss vs. Dataset Size (D)

An empirical fit formulated by Kaplan et al. (2020) demonstrates that a language model's test loss, denoted by L\mathcal{L}, decreases as a power-law function of the training dataset size, denoted by DD. After an initial transient phase, this relationship is mathematically defined as L(D)=(D5.4×1013)0.095\mathcal{L}(D) = \big( \frac{D}{5.4 \times 10^{13}} \big)^{-0.095}. In this equation, 0.095-0.095 is the scaling exponent and 5.4×1013{}5.4 \times 10^{13} is an empirically derived constant.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences