Activity (Process)

Two-Stage Sequence Length Pre-training

To improve training efficiency and mitigate the computational cost of pre-training large models on massive datasets, a common practice is to adopt a two-stage approach for sequence lengths. Initially, the model is trained on relatively short text sequences for a large number of training steps. Subsequently, it continues training on full-length sequences for the remaining steps. This strategy reduces the overall effort and time required for pre-training.

0

1

Updated 2026-04-17

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences