Multiple Choice

A machine learning engineer is training a large transformer-based model on a text dataset where documents can be up to 512 tokens long. To accelerate the process, they first train the model for 90% of the total training steps using only the first 128 tokens of each document. For the final 10% of the steps, they switch to using the full 512-token sequences. What is the primary reason this two-stage training strategy is effective?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science