Multiple Choice
A machine learning engineer is training a large transformer-based model on a text dataset where documents can be up to 512 tokens long. To accelerate the process, they first train the model for 90% of the total training steps using only the first 128 tokens of each document. For the final 10% of the steps, they switch to using the full 512-token sequences. What is the primary reason this two-stage training strategy is effective?
0
1
Updated 2025-09-26
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science