1Cademy - Combined Power Law for LLM Loss with Model and Dataset Size

Learn Before

Improved Power Law Formula for LLM Loss
Empirical Power Law for LLM Loss vs. Model Size (N)
Empirical Power Law for LLM Loss vs. Dataset Size (D)

Formula

Combined Power Law for LLM Loss with Model and Dataset Size

To account for multiple factors simultaneously, the loss of a Large Language Model can be modeled as a function of both the number of model parameters, $N$ , and the size of the training dataset, $D$ . This relationship is captured by a combined scaling law developed by Rosenfeld et al., which incorporates an irreducible error term, $\epsilon_{\infty}$ , resulting in the formula: $\mathcal{L}(N,D) = aN^b + cD^d + \epsilon_{\infty}$ . In this equation, the terms $aN^b$ and $cD^d$ represent the independent contributions of model size and dataset size to the overall loss, following a power law.