Learn Before
Activity (Process)
Two-Stage Sequence Length Pre-training
To improve training efficiency and mitigate the computational cost of pre-training large models on massive datasets, a common practice is to adopt a two-stage approach for sequence lengths. Initially, the model is trained on relatively short text sequences for a large number of training steps. Subsequently, it continues training on full-length sequences for the remaining steps. This strategy reduces the overall effort and time required for pre-training.
0
1
Updated 2026-04-17
Tags
Foundations of Large Language Models
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences