Learn Before
Empirical Power Law for LLM Loss vs. Model Size (N)
Research by Kaplan et al. (2020) demonstrated that after an initial transient period, the performance of their language models improved as a power law in relation to the model size, denoted by . This empirical scaling behavior for model size is expressed mathematically as: , where is the loss of the model.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Empirical Power Law for LLM Loss vs. Model Size (N)
Empirical Power Law for LLM Loss vs. Dataset Size (D)
Two language models, Model A and Model B, have their performance (loss, L) modeled as a function of a resource
x(wherex > 1). The relationship for each is described by a power law equation:- Model A:
L(x) = 0.5 * x^-0.1 - Model B:
L(x) = 0.5 * x^-0.2
Based on these equations, which statement correctly analyzes the models' improvement as more of the resource
xis used?- Model A:
Interpreting the Power Law Exponent
Model Selection Based on Performance Scaling
Learn After
Combined Power Law for LLM Loss with Model and Dataset Size
A research team is deciding between two language model sizes. Model A will have 10 billion parameters, and Model B will have 100 billion parameters. According to the empirical relationship where performance loss (L) is a function of the number of parameters (N), as shown in the formula below, which model should the team choose to achieve a lower final loss, and what is the justification?
Interpreting Model Scaling Effects
Interpreting the Model Scaling Formula
Visualizing Empirical Scaling Laws for LLM Loss
Power Law Fit for Test Loss vs. Model and Dataset Size