Analyzing a Language Model's Pre-training Log
You are monitoring the pre-training of a large language model. The table below shows the validation loss measured at various training checkpoints. A lower validation loss indicates better performance. Based on this data, identify the optimal checkpoint to use for the final model and explain the reasoning behind your choice, referencing the phenomenon that this strategy is designed to prevent.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A machine learning engineer is pre-training a large language model. They monitor the model's performance on a separate, unseen dataset after every 10,000 training steps. They observe the following trend:
- Steps 1-100,000: Performance steadily improves.
- Step 110,000: The model achieves its best performance so far.
- Steps 120,000-150,000: Performance consistently worsens with each measurement.
Based on this observation, what is the most appropriate immediate action to ensure the best possible model is obtained from this training run?
Analyzing a Language Model's Pre-training Log
Rationale for Early Stopping in Model Pre-training