Learn Before
Concept

Tandem Scaling of LLM Training Factors

Transformer language model performance exhibits power-law scaling with respect to three key factors: model size (number of parameters, excluding embedding layers), dataset size (number of training tokens), and the amount of training compute. For optimal performance, all three of these factors must be scaled up in tandem, although the precise method for increasing them together remains an area of ongoing research.

Image 0

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L