Learn Before
Tandem Scaling of LLM Training Factors
Transformer language model performance exhibits power-law scaling with respect to three key factors: model size (number of parameters, excluding embedding layers), dataset size (number of training tokens), and the amount of training compute. For optimal performance, all three of these factors must be scaled up in tandem, although the precise method for increasing them together remains an area of ongoing research.

0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
A research team is training a large language model and has a fixed, non-negotiable computational budget. Their goal is to achieve the lowest possible final loss. Based on the established principles that govern the relationship between computation, model size, data size, and performance, which of the following strategies represents the most efficient use of their budget?
Evaluating an LLM Training Strategy
Analyzing Deviations from LLM Scaling Behavior
Continued Effectiveness of Scaling up Training in NLP
Power-Law Curve of Performance Scaling
Scaling Laws Across LLM Development Stages
Tandem Scaling of LLM Training Factors
Sample Efficiency of Large Language Models
Performance Scaling in GPT-3