Learn Before
A team is training a large-scale model on a distributed cluster of several thousand machines, a process expected to last for multiple weeks. They decide to prioritize raw computational speed and do not implement any mechanisms to handle potential machine failures during the training run. Which of the following is the most critical risk associated with this design choice?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Evaluating a Distributed System Configuration
A team is training a large-scale model on a distributed cluster of several thousand machines, a process expected to last for multiple weeks. They decide to prioritize raw computational speed and do not implement any mechanisms to handle potential machine failures during the training run. Which of the following is the most critical risk associated with this design choice?
Trade-offs in Fault Tolerance Checkpointing